NVIDIA introduces Cosmos 3, an open omni-model for physical AI | Keryc
NVIDIA launches Cosmos 3, an open omni-model designed so machines can understand and act in the physical world. Can you imagine a single model capable of generating video, reasoning about physics, and producing action sequences for robots or self-driving cars? That's exactly what Cosmos 3 aims to solve.
What's in Cosmos 3
Cosmos 3 arrives with several practical components for developers and researchers: models on Hugging Face with their model cards and licenses, integration with Diffusers for generation pipelines, post-training scripts on GitHub, and synthetic dataset groups (SDG) focused on physical AI. If you work in robotics, autonomous driving, or smart-space simulation, this is already production-grade material.
Key capabilities
Cosmos 3 is an omni-model: instead of having one model to generate worlds, another to understand scenes, and a third for policies, everything is unified here. What can you do with a single model?
Generate realistic, physically plausible video worlds from text, image, video, or action inputs.
Reason about physical properties: motion, causality, and spatial relationships.
Predict future sequences of video and actions from the current state.
Produce policies and actions (forward/inverse dynamics) without changing the architecture.
That opens the door to using the same foundation for very different tasks: training a robot that folds clothes, simulating test scenarios for a self-driving car, or generating synthetic data for warehouse safety.
Architecture: Mixture-of-Transformers (MoT)
The big technical novelty is the Mixture-of-Transformers (MoT) architecture. Cosmos 3 processes text, image, video, audio, and actions inside a single network. First each modality goes through its dedicated encoder: a ViT for vision, a VAE for visual/audio generation, and specialized vectors for actions. Then everything is projected into a shared representation space.
The input is divided into two subsequences:
Autoregressive (AR): handles reasoning and understanding with next-token prediction.
Diffusive (DM): handles generation via iterative denoising.
AR and DM use separate parameters within each transformer layer but can interact via joint attention. That interaction allows the model to act as a VLM, a video generator, a dynamics model, or a policy without changing the backbone.
Useful technical detail
Modal encoders preserve domain inductives before projecting to the shared space.
The AR/DM separation keeps stability for reasoning tasks and flexibility for stochastic generation.
The design facilitates post-training in concrete domains (for example, a specific robot or a warehouse environment) because the overall structure doesn't change.
Model versions and deployment
This release includes two sizes optimized for different uses:
Cosmos 3 Nano: an 8B architecture (8B reasoner + 8B generator) designed for efficient inference. Intended to run on workstation-class GPUs like the RTX PRO 6000. Available on Hugging Face as nvidia/Cosmos3-Nano.
Cosmos 3 Super: a 32B architecture (32B reasoner + 32B generator) oriented to large-scale synthetic data generation and research. Requires datacenter-class GPUs (Hopper and Blackwell). Available on Hugging Face as nvidia/Cosmos3-Super.
Practical recommendation: use Nano for prototypes and local deployments; Super for mass production of SDG or experiments that scale in performance.
Integration with Diffusers and practical example
Cosmos 3 integrates with the Diffusers library via Cosmos3OmniPipeline, which makes it easy to incorporate pipelines into existing projects. Here's an example of Text-to-Image with Cosmos 3 Nano:
import torch
from diffusers import Cosmos3OmniPipeline
pipe = Cosmos3OmniPipeline.from_pretrained(
"nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)
prompt = (
"A medium shot of a modern robotics research laboratory with white walls and a gray floor. "
"A robotic arm with a metallic finish is mounted on a clean white workbench, its gripper positioned "
"above a row of small colored objects. A laptop and neatly arranged tools sit beside the robot. "
"A large monitor on the wall behind displays a software interface. The scene is brightly lit by "
"overhead fluorescent lights."
)
result = pipe(prompt=prompt, num_frames=1, height=720, width=1280)
result.video[0].save("cosmos3_t2i.jpg", format="JPEG", quality=85)
This example shows how simple it can be to generate a frame or a short clip and then plug it into a simulation or synthetic labeling pipeline.
Datasets for physical AI (SDG)
NVIDIA publishes several synthetic datasets aimed at physical interaction and embodied problems, useful for post-training or evaluation:
Embodied-Robot-Scenes: robotic simulation data.
Physical-Interaction-Scenes: physical simulation data (Isaac Sim).
Spatial-Reasoning: spatial reasoning tasks.
Digital-Human-Scenes: synthetic human motion.
Autonomous-Driving-Scenarios: driving scenarios.
Warehouse-Operations-Scenes: warehouse operations and safety.
These SDG help reduce reliance on expensive real data and speed up model validation on edge queues or in data centers.
Post-training, Cosmos Framework and agent skills
Although Cosmos 3 comes ready for many tasks, NVIDIA recommends post-training to adapt the model to specific robots, sensors, and environments. The repo includes post-training scripts and NIM microservices for production.
The Cosmos Framework offers a full stack: inference scripts, post-training utilities, and skills (small agents) that automate tests, setups, and prompt examples. It's a good way to get started quickly and avoid common stumbles when integrating a WFM (world foundation model).
Why does this matter now?
Because Cosmos 3 isn't just another generative model: it's an attempt to unify physical reasoning, multimodal generation, and control in a single backbone. For you as a developer, that can mean less friction between modules; for researchers, an easier playground to experiment with dynamics and physical reasoning; and for companies, a more direct path to create simulations and high-quality synthetic data.
If you're working in robotics or simulation, treat this as a base: prototype with Nano, scale with Super, and adapt with post-training. It's not magic; it's engineering to help models understand motion, cause, and effect.
References and resources
On the technical blog and repo you'll find full documentation, a prompt guide, Text-to-Video and Image-to-Video examples, and post-training and deployment instructions with NIM.