Waypoint-1: real-time interactive video diffusion

Jan 20, 20264 minutes

Waypoint-1 is Overworld's proposal to bring interactive world models to real time, controllable by text, mouse, and keyboard. Can you imagine giving a model just a few frames and being able to step into that world, move the camera freely, and press any key with no noticeable latency? That's exactly what Waypoint-1 aims for.

Qué es Waypoint-1

Waypoint-1 is a latent video diffusion model designed from the ground up for interactivity. There are two announced variants: Waypoint-1-Small and Waypoint-1-Medium (coming soon). It's trained on 10,000 hours of gameplay footage, including control inputs and textual captions that describe the scene.

Unlike many world models that take a pretrained video model and fine-tune it with simple controls, Waypoint-1 is trained from scratch with complex controls in mind. The result? Essentially unlimited controls: move the camera with the mouse, send any keyboard key — all with virtually zero latency because each frame is generated conditioned on the current controls.

Arquitectura y entrenamiento (técnico pero claro)

Backbone: a frame-rectified causal transformer. Frame-causal means each token can only attend to its own frame and past frames, never to future frames.
Latent: the model operates in a compressed frame space, not directly on pixels, to speed up processing and reduce memory.
Primary training: diffusion forcing. The model learns to denoise future frames from previous frames. Each frame is noised randomly and trained to denoise independently, which enables generating frames one by one during inference.
Identified problem: random per-frame noise creates a mismatch with autoregressive inference (errors accumulate in long rollouts and noisy artifacts appear).
Solution: post-training with self-forcing using DMD. This aligns the training regime with inference behavior, reduces error accumulation, and enables denoising in few steps along with a form of one-pass CFG.

Important: the frame-causal design and the self-forcing phase are key so the model remains usable in long interactive streams without rapid degradation.

WorldEngine: the inference library

WorldEngine is the high-performance library Overworld publishes to run Waypoint-1 in real time. It's built for interactive applications and written in Python, optimized for low latency and high throughput.

The runtime loop consumes: context frames, keyboard/mouse and text inputs, and produces output frames ready for streaming. It's designed so you can integrate the model into games, interactive demos, or procedural creation pipelines.

Rendimiento y métricas concretas

On Waypoint-1-Small (2.3B parameters) running on an RTX 5090, WorldEngine achieves:

~30,000 token-passes/second (a single denoising step; 256 tokens per frame).
30 FPS at 4 steps.
60 FPS at 2 steps.

That means on modern consumer hardware you can get a smooth experience without needing huge clusters.

Optimizaciones que marcan la diferencia

WorldEngine bundles several optimizations that together produce the observed performance:

AdaLN feature caching: avoids recomputing AdaLN conditioning projections when the prompt and timesteps don't change between passes.
Static Rolling KV Cache + Flex Attention: cache design and flexible attention scheme for efficient key/value access across frames.
Matmul fusion: fuses standard QKV operations to reduce inference overhead.
Use of torch.compile with an aggressive configuration:

torch.compile(fullgraph=True, mode="max-autotune", dynamic=False)

These pieces together keep latency low and throughput high in the interactive loop.

Ejemplo de uso (práctico)

WorldEngine exposes a simple API to prototype an interactive experience. Here’s an example that shows the idea:

from world_engine import WorldEngine, CtrlInput
# Create inference engine
engine = WorldEngine("Overworld/Waypoint-1-Small", device="cuda")
# Specify prompt
engine.set_prompt("A game where you herd goats in a beautiful valley")
# Optionally force the next frame
img = pipeline.append_frame(uint8_img)  # (H, W, 3)
# Generate 3 frames conditioned on controller inputs
for controller_input in [
    CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
    CtrlInput(mouse=[0.1, 0.2]),
    CtrlInput(button={95, 32, 105}),
]:
    img = engine.gen_frame(ctrl=controller_input)

With this you can prototype anything from camera-only demos to full games where each key triggers actions in the generated world.

Aplicaciones y limitaciones prácticas

What is this useful for today? Some concrete ideas:

Rapid prototyping of levels and game mechanics without 3D modeling.
Interactive creative experiences for installations or digital art.
AI-assisted procedural design tools.

Limitations to keep in mind:

Waypoint-1 is strong in game-like dynamics thanks to its dataset, but its behavior outside that domain (for example, realistic footage from handheld cameras) can be less robust.
Long-term state persistence and semantic coherence over very long horizons remain general challenges for world models.

Eventos y recursos

Overworld is running a world_engine hackathon on January 20, 2026. It’s a good chance if you want to try the library, win a 5090 GPU, and meet founders and engineers. To see live demos you can visit: https://overworld.stream

It’s designed so small teams and developers can experiment and extend the runtime; if you want to explore practical applications, it’s a strong starting point.

Reflexión final

Waypoint-1 combines architecture choices and inference optimizations to bring interactive video models to consumer hardware. Techniques like diffusion forcing and the subsequent fix with self-forcing show that training with inference in mind pays off when the goal is real interactivity.

Are you interested in building interactive AI experiences without relying on massive infrastructure? Waypoint-1 and WorldEngine are a step in that direction.

Original source

https://huggingface.co/blog/waypoint-1

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.