Fine-tuning Cosmos Predict 2.5 with LoRA/DoRA for Robot Videos | Keryc
NVIDIA publishes a technical guide to adapt Cosmos Predict 2.5 to concrete robotics tasks, showing how to use LoRA and DoRA to generate synthetic robot trajectories without retraining the whole model. What's the goal? Create physically plausible videos conditioned on text and images, and use them as scalable data to train robot policies.
Qué anuncia NVIDIA
Cosmos Predict 2.5 is a large-scale world model that generates physically consistent videos conditioned on text, images, or clips. NVIDIA shows a parametric-efficient fine-tuning pipeline using LoRA and DoRA to adapt the model to specific domains (for example, robot manipulation or particular camera views).
The practical novelty: instead of retraining the model's 2B parameters (expensive and prone to forgetting general knowledge), you inject small, portable adapters that let you train on a single powerful GPU and then swap adapters per task.
Por qué esto es útil para robótica
Collecting real trajectories is slow and costly. What if you could generate thousands of synthetic trajectories that are physically plausible and specific to your camera or robot setup? That speeds up policy iteration.
LoRA/DoRA keeps the backbone frozen and only adjusts low-dimensional parameters. The resulting adapter is small, easy to share, and avoids the catastrophic forgetting you get when retraining the whole model.
Cómo implementan LoRA y DoRA en Cosmos Predict 2.5
Cosmos Predict 2.5 has three submodules: a VAE that encodes videos to latents, a text encoder, and a DiT that performs latent diffusion. During tuning, all base weights stay frozen and adapters are injected at key points of the DiT.
Typical target modules are attention projections to_q, to_k, to_v, to_out.0 and some feedforward layers ff.net.0.proj, ff.net.2.
Minimal example of loading the pipeline and adding LoRA (summarized):
You configure LoraConfig with r (rank), lora_alpha and target_modules, and then call dit.add_adapter(lora_config).
For numerical stability, the adapter's trainable parameters are upcast to float32 when training with bf16.
If you enable use_dora=True, the adapter uses magnitude-direction decomposition (DoRA) instead of a pure low-rank update. You don't need to change the training loop.
Datos y pipeline de entrenamiento
Dataset de entrenamiento: 92 videos de manipulación robótica con prompts textuales (pick-and-place).
Dataset de evaluación: 50 pares (prompt, imagen inicial) para generar un video por par.
Expected format after preprocessing:
gr1_dataset/train with videos/*.mp4, metas/*.txt and metadata.csv.
gr1_dataset/test with pairs filename1.txt, filename1.png, etc.
The VideoDataset loads each example as (caption, video) and if the video is longer than num_frames it extracts a random window each epoch to augment temporal variants.
Algoritmo de entrenamiento y pérdida
Cosmos Predict uses rectified flow. In short: for a noise level σ_t you build x_t = σ_t * noise + (1 - σ_t) * clean and the model predicts the velocity that transforms the noise toward the clean sample. The first two frames are used as condition and are not noised.
The loss is an MSE between the predicted velocity and the target velocity (noise - clean_latent) computed only on the non-conditioned frames.
The optimizer is AdamW over the adapter parameters and the scheduler linearly warms the learning rate then decays it according to a defined policy.
Comandos, hardware y checkpoints
Main dependencies: Python 3.10+, PyTorch 2.5+ with CUDA, diffusers, accelerate, transformers, peft. Install with:
Hardware requirements: at least one 80 GB GPU for single-device training; 8× H100 for fast iterations. NVIDIA shows 100 epochs give solid results: ~17 h on 1 H100, ~2.5 h on 8 H100s.
Adapters are saved as pytorch_lora_weights.safetensors every checkpointing_epochs.
At inference you load the pipeline, apply pipe.load_lora_weights("/path/to/lora/checkpoint") and optionally pipe.fuse_lora(lora_scale=1.0) to fuse weights and avoid runtime overhead.
There is also an option to make noise initialization reproducible across architectures: arch_invariant_rand.
Evaluación: métricas y juez LLM
To measure geometric quality they use the Sampson error:
Temporal Sampson Error: consistency between consecutive frames.
Cross-view Sampson Error: consistency between simultaneous views.
They also use Cosmos Reason2 (an LLM) as a judge with two YAML rubrics:
video_physics.yaml: judges physical plausibility without seeing the prompt.
video_IF.yaml: evaluates whether the video follows the instruction (prompt + video).
Each video gets a 1–5 score on both dimensions.
Resultados y lecciones prácticas
They compared the base model (no fine-tuning), LoRA and DoRA with r=8 and r=32. Key findings:
Before tuning the base model tends to fail with robot hands (it hallucinates human hands), doesn't always follow the correct hand, and shows jitter.
Fine-tuning with LoRA and DoRA reduces those issues: better instruction following and less jitter.
Higher rank (r=32 vs r=8) improves instruction following (for example, using the correct hand or grasping the indicated object). However, geometric consistency and physical plausibility don't improve much with higher rank.
Interpretation hypothesis: geometric and physical priors are already well encoded in the frozen backbone; the adapter only needs to shift the distribution toward the robotic domain's appearance and behavior. That's why r=8 can be enough to fix appearance and basic structure.
Practical recommendation:
If your memory is limited or adapter size matters: start with LoRA r=8.
If you see instability with low-rank LoRA or you have memory budget, try DoRA r=32. DoRA can stabilize learning via its magnitude-direction decomposition.
For reproducibility, use arch_invariant_rand. For fast deployment, use pipe.fuse_lora().
Conclusión
NVIDIA's guide shows that adapting a large world model to a robotic domain is practical with parameter-efficient training strategies. Generating synthetic trajectories with Cosmos Predict 2.5 + LoRA/DoRA opens the door to faster development cycles for imitation learning and policy simulation.
If you work with robots and have scarce real data, this approach lets you create camera-/robot-specific datasets without sacrificing the model's general knowledge. The takeaway? You don't always need to retrain everything: a well-designed adapter can be the piece that speeds up your project.