NVIDIA introduces Cosmos 3, an open omni-model for physical AI

May 31, 2026Keryc Díaz4 minutes

NVIDIA launches Cosmos 3, an open omni-model designed so machines can understand and act in the physical world. Can you imagine a single model capable of generating video, reasoning about physics, and producing action sequences for robots or self-driving cars? That's exactly what Cosmos 3 aims to solve.

What's in Cosmos 3

Cosmos 3 arrives with several practical components for developers and researchers: models on Hugging Face with their model cards and licenses, integration with Diffusers for generation pipelines, post-training scripts on GitHub, and synthetic dataset groups (SDG) focused on physical AI. If you work in robotics, autonomous driving, or smart-space simulation, this is already production-grade material.

Key capabilities

Cosmos 3 is an omni-model: instead of having one model to generate worlds, another to understand scenes, and a third for policies, everything is unified here. What can you do with a single model?

Generate realistic, physically plausible video worlds from text, image, video, or action inputs.
Reason about physical properties: motion, causality, and spatial relationships.
Predict future sequences of video and actions from the current state.
Produce policies and actions (forward/inverse dynamics) without changing the architecture.

That opens the door to using the same foundation for very different tasks: training a robot that folds clothes, simulating test scenarios for a self-driving car, or generating synthetic data for warehouse safety.

Architecture: Mixture-of-Transformers (MoT)

The big technical novelty is the Mixture-of-Transformers (MoT) architecture. Cosmos 3 processes text, image, video, audio, and actions inside a single network. First each modality goes through its dedicated encoder: a ViT for vision, a VAE for visual/audio generation, and specialized vectors for actions. Then everything is projected into a shared representation space.

The input is divided into two subsequences:

Autoregressive (AR): handles reasoning and understanding with next-token prediction.
Diffusive (DM): handles generation via iterative denoising.

AR and DM use separate parameters within each transformer layer but can interact via joint attention. That interaction allows the model to act as a VLM, a video generator, a dynamics model, or a policy without changing the backbone.

Useful technical detail

Modal encoders preserve domain inductives before projecting to the shared space.
The AR/DM separation keeps stability for reasoning tasks and flexibility for stochastic generation.
The design facilitates post-training in concrete domains (for example, a specific robot or a warehouse environment) because the overall structure doesn't change.

Model versions and deployment

This release includes two sizes optimized for different uses:

Cosmos 3 Nano: an 8B architecture (8B reasoner + 8B generator) designed for efficient inference. Intended to run on workstation-class GPUs like the RTX PRO 6000. Available on Hugging Face as nvidia/Cosmos3-Nano.
Cosmos 3 Super: a 32B architecture (32B reasoner + 32B generator) oriented to large-scale synthetic data generation and research. Requires datacenter-class GPUs (Hopper and Blackwell). Available on Hugging Face as nvidia/Cosmos3-Super.

Practical recommendation: use Nano for prototypes and local deployments; Super for mass production of SDG or experiments that scale in performance.

Integration with Diffusers and practical example

Cosmos 3 integrates with the Diffusers library via Cosmos3OmniPipeline, which makes it easy to incorporate pipelines into existing projects. Here's an example of Text-to-Image with Cosmos 3 Nano:

import torch
from diffusers import Cosmos3OmniPipeline

pipe = Cosmos3OmniPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano", torch_dtype=torch.bfloat16, device_map="cuda"
)

prompt = (
    "A medium shot of a modern robotics research laboratory with white walls and a gray floor. "
    "A robotic arm with a metallic finish is mounted on a clean white workbench, its gripper positioned "
    "above a row of small colored objects. A laptop and neatly arranged tools sit beside the robot. "
    "A large monitor on the wall behind displays a software interface. The scene is brightly lit by "
    "overhead fluorescent lights."
)

result = pipe(prompt=prompt, num_frames=1, height=720, width=1280)
result.video[0].save("cosmos3_t2i.jpg", format="JPEG", quality=85)

This example shows how simple it can be to generate a frame or a short clip and then plug it into a simulation or synthetic labeling pipeline.

Datasets for physical AI (SDG)

NVIDIA publishes several synthetic datasets aimed at physical interaction and embodied problems, useful for post-training or evaluation:

Embodied-Robot-Scenes: robotic simulation data.
Physical-Interaction-Scenes: physical simulation data (Isaac Sim).
Spatial-Reasoning: spatial reasoning tasks.
Digital-Human-Scenes: synthetic human motion.
Autonomous-Driving-Scenarios: driving scenarios.
Warehouse-Operations-Scenes: warehouse operations and safety.

These SDG help reduce reliance on expensive real data and speed up model validation on edge queues or in data centers.

Post-training, Cosmos Framework and agent skills

Although Cosmos 3 comes ready for many tasks, NVIDIA recommends post-training to adapt the model to specific robots, sensors, and environments. The repo includes post-training scripts and NIM microservices for production.

The Cosmos Framework offers a full stack: inference scripts, post-training utilities, and skills (small agents) that automate tests, setups, and prompt examples. It's a good way to get started quickly and avoid common stumbles when integrating a WFM (world foundation model).

Why does this matter now?

Because Cosmos 3 isn't just another generative model: it's an attempt to unify physical reasoning, multimodal generation, and control in a single backbone. For you as a developer, that can mean less friction between modules; for researchers, an easier playground to experiment with dynamics and physical reasoning; and for companies, a more direct path to create simulations and high-quality synthetic data.

If you're working in robotics or simulation, treat this as a base: prototype with Nano, scale with Super, and adapt with post-training. It's not magic; it's engineering to help models understand motion, cause, and effect.

References and resources

On the technical blog and repo you'll find full documentation, a prompt guide, Text-to-Video and Image-to-Video examples, and post-training and deployment instructions with NIM.

Original source

https://huggingface.co/blog/nvidia/cosmos-3-for-physical-ai

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.