Hugging Face speeds up ZeroGPU with AoT compilation in PyTorch

3 minutes
HUGGINGFACE
Hugging Face speeds up ZeroGPU with AoT compilation in PyTorch

ZeroGPU is Hugging Face's solution for running demos on Spaces without keeping a GPU busy all the time. What's the catch? Demos that generate images or video suffer from cold starts because just-in-time compilation takes too long. Hugging Face proposes using PyTorch's ahead-of-time compilation to prepare optimized binaries once and reload them on the fly, achieving faster responses and smoother demos. (huggingface.co)

What is ZeroGPU and why does it matter?

ZeroGPU lets a Space use powerful GPUs like the H200 only when there’s traffic. Instead of reserving a GPU permanently, ZeroGPU spins up short-lived processes that initialize CUDA, run the job, and shut down — so you don't pay for idle hardware when there are no users. That makes it ideal for public demos with intermittent traffic. (huggingface.co)

Note: the feature is available to Pro users and teams, although anyone can use ZeroGPU Spaces created by those accounts. (huggingface.co)

What does ahead-of-time (AoT) compilation bring?

PyTorch offers two compilation paths: torch.compile for runtime compilation and an AoT flow using torch.export + AOTInductor to compile before serving. In environments where GPU processes are ephemeral, like ZeroGPU, torch.compile can’t reuse its work well. AoT lets you export and reload an already-optimized model almost instantly.

That reduces framework overhead and removes a big chunk of startup time. Basically, you pay the compilation cost once and then get near-instant launches. (huggingface.co)

How they implement it in a Space (practical summary)

The Hugging Face guide shows a concrete flow with these steps:

  • Capture example inputs from the heavy component (for example pipe.transformer).
  • Export the component to an ExportedProgram with torch.export.export using those examples.
  • Compile the export with a helper spaces.aoti_compile that wraps torch._inductor.aot_compile.
  • Apply the compiled binary to the pipeline with spaces.aoti_apply so no old parameters linger in memory.
  • Run the entire compilation process inside an @spaces.GPU because the compilation depends on metrics and tests on real hardware. (huggingface.co)

Example of minimal code changes:

import gradio as gr
import spaces
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
pipe.to('cuda')

@spaces.GPU
def generate(prompt):
    return pipe(prompt).images

# Additional AoT code (abridged):
@spaces.GPU(duration=1500)
def compile_transformer():
    with spaces.aoti_capture(pipe.transformer) as call:
        pipe("example")
    exported = torch.export.export(pipe.transformer, args=call.args, kwargs=call.kwargs)
    return spaces.aoti_compile(exported)

compiled = compile_transformer()
spaces.aoti_apply(compiled, pipe.transformer)

This is only a few extra lines but can noticeably improve latency. (huggingface.co)

Things to keep in mind before compiling AoT

  • Quantization: combining AoT with quantization (for example FP8) gives extra gains, but FP8 requires CUDA capability 9.0 or higher. On ZeroGPU the H200 already supports these techniques. (huggingface.co)

  • Dynamic shapes: images and videos come in variable resolutions. When exporting, you need to mark which dimensions are dynamic and set min/max ranges for torch.export. That takes inspection and some trial and error depending on the model. (huggingface.co)

  • Multi-compile: if variability is very large (for example many video sizes), one strategy is to compile multiple versions per resolution and share weights between them, dispatching the correct binary at runtime. (huggingface.co)

  • FlashAttention-3: FA3 is supported and speeds things further; Hugging Face recommends using their hub of precompiled kernels to avoid compiling kernels from scratch and wasting GPU time. (huggingface.co)

Results and examples you can try

In their published examples, improvements range from about 1.3× to 1.8× depending on the model and options applied. With FLUX.1-dev they reported around 1.7× latency improvement, and in combos with FA3 some cases show 1.75×. If you want to experiment, Hugging Face gathered several examples in an organization called ZeroGPU-AOTI for direct testing. (huggingface.co)

You can see demos and resources here: ZeroGPU-AOTI on Hugging Face. It's also useful to review the AOTInductor tutorial in PyTorch docs to understand advanced options. (huggingface.co)

What does this mean for you?

If you publish Spaces demos that generate images, video, or any GPU-heavy task, AoT on ZeroGPU is a practical lever to reduce latency and costs. Do you want a demo that responds quickly to a user or would you rather pay for GPUs that are always on? AoT lets you pick the first option without sacrificing performance.

If you're a developer, try this flow on a branch of your Space: capture real examples, export, compile with spaces.aoti_compile, and apply the binary. If you're putting together a public showcase or a client prototype, this makes the experience feel more professional without increasing the bill for idle GPUs. (huggingface.co)

If you want to dive deeper, the original post includes code, demos, and links to resources like precompiled kernels and community discussions. (huggingface.co)

Stay up to date!

Receive practical guides, fact-checks and AI analysis straight to your inbox, no technical jargon or fluff.

Your data is safe. Unsubscribing is easy at any time.