NVIDIA introduces Nemotron 3 Nano Omni, an omni-modal model designed to understand long documents, complex images, audio and video together with deep reasoning. Sound like a mix of OCR, speech recognition and video understanding all in one? Exactly: that’s the goal, and the results already lead several benchmarks.
What is Nemotron 3 Nano Omni
Nemotron 3 Nano Omni is the evolution of the Nemotron line: it moves from a strong vision-text system to a model that integrates text, image, video and audio across very long contexts. It’s built for real, complex tasks: analyzing contracts and papers over 100 pages, transcribing and understanding long audio, joint reasoning in narrated video, and agents that interact with graphical interfaces.
It’s not just perception. It’s perception plus reasoning: structured extraction, reading tables and charts, multi-step reasoning and the ability to abstain when evidence is insufficient.
Key architecture and how it handles long context
At its core it combines the Nemotron 3 hybrid Mamba-Transformer Mixture-of-Experts (MoE) backbone with specialized encoders: C-RADIOv4-H for vision and Parakeet-TDT-0.6B-v2 for audio. The connection between encoders and the LLM uses lightweight 2-layer MLP projectors that merge features into a shared space.
Main components:
- 23-layer Mamba selective state-space to process long context efficiently.
- 23-layer MoE with 128 experts and
top-6routing plus one shared expert for conditional capacity. - 6 layers of grouped-query attention to keep strong global interactions.
- Interleaving of text, vision and audio tokens for truly multimodal reasoning.
Vision and video:
- Dynamic-resolution processing with 16x16 patches. Each image can use between 1,024 and 13,312 visual patches (roughly ~512x512 up to ~1840x1840), which helps preserve fine detail and global structure.
- For video they use a
Conv3Dpath that fuses frame pairs into "tubelets", halving the visual tokens the LLM must attend to. - EVS (video token pruning) during inference removes static tokens and keeps dynamic ones, lowering latency without losing accuracy.
Audio:
- Parakeet-TDT processes audio at 16 kHz, trained with inputs up to 1,200 seconds (20 minutes) and the backbone supports contexts of 5+ hours.
- Audio is projected into the shared space and modeled alongside image and text, enabling joint temporal reasoning (for example, spotting what’s said exactly when a particular frame appears).
Result? A modular design that enables true multimodal fusion inside the backbone and scales to very long contexts.
Training and technical recipe
The training recipe is staged: multimodal alignment, context extension, preference optimization and multimodal reinforcement learning (omni RL). The SFT stages trained on NVIDIA H100 clusters scaling from 32 to 128 nodes, using Megatron-LM, Transformer Engine and Megatron Energon with tensor, expert, sequence and context parallelisms.
Post-SFT uses NeMo-RL and NeMo Gym with distributed Ray infrastructure, plus multimodal deduplication measures to prevent repeated rollouts from multiplying image, video and audio memory. NVIDIA open-sources key parts of the training code.
Data and synthetic:
- About ~11.4M synthetic QA pairs (~45B tokens) were generated from real PDFs with NeMo Data Designer to reinforce reasoning on long documents. That yielded a 2.19x improvement on MMLongBench-Doc.
- Multi-stage pipelines were used to create synthetic data plus a verification suite that includes unanswerable cases to teach the model to abstain instead of hallucinating.
Benchmarks and efficiency
Nemotron 3 Nano Omni delivers notable gains in accuracy and efficiency versus open-weight alternatives. Key numbers:
| Task | Benchmark | Nemotron 3 Nano Omni |
|---|---|---|
| Document understanding | OCRBenchV2-En | 65.8 |
| MMLongBench-Doc | MMLongBench-Doc | 57.5 |
| CharXiv reasoning | CharXiv | 63.6 |
| GUI (ScreenSpot-Pro) | GUI | 57.8 |
| Video understanding | Video-MME | 72.2 |
| Video + Audio | WorldSense | 55.4 |
| Voice interaction | VoiceBench | 89.4 |
| ASR (lower better) | HF Open ASR | 5.95 |
Efficiency:
- Up to 9x higher throughput and 2.9x faster single-stream reasoning in multimodal use-cases compared to alternatives.
- 7.4x more system efficiency on multi-document workloads and 9.2x on video for interactive per-user loads.
In short: better metrics and better cost-performance for tasks that combine documents, audio and video.
Practical use cases
- Long, messy documents: contracts, reports, manuals and 100+ page PDFs with tables, figures, formulas and cross-references.
- Transcription and analysis of long audio: meetings, interviews and lectures with multiple speakers and background noise.
- Joint video + audio: screen recordings with narration, tutorials, demos and archived videos where voice changes visual meaning.
- GUI agents: interpret screenshots, monitor interface state and perform actions (includes examples using
pyautoguiand functions likecomputer.waitorcomputer.terminate).
Real example shown: extracting financial metrics from a 100+ page document in a single pass, reading tables and reasoning across pages.
Limitations and practical considerations
- Hardware requirements: although the model is efficient versus alternatives, training and serving a 30B multimodal LLM demands resources (H100, distributed infra) or using optimized checkpoints (
BF16,FP8,NVFP4). - Quality of synthetic data: very helpful, but you must audit and validate results in sensitive domains.
- Hallucination risks: training includes verification and unanswerable cases to reduce inventions, but you should always verify critical outputs.
- Privacy and compliance: when you integrate company documents and audio, apply privacy controls and data governance.
How to try it and technical resources
Checkpoints and official resources:
- BF16 checkpoint: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
- FP8 checkpoint: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8
- NVFP4 checkpoint: https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4
- Technical report PDF: https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Omni-report.pdf
- Image training dataset v3: https://huggingface.co/datasets/nvidia/Nemotron-Image-Training-v3
- Megatron-Bridge examples: https://github.com/NVIDIA-NeMo/Megatron-Bridge/tree/main/examples/models/vlm/nemotron_3_omni
- Nemo-RL docs: https://github.com/NVIDIA-NeMo/RL/blob/nano-v3-omni/docs/guides/nemotron-3-nano-omni.md
- NeMo Data Designer recipes: https://github.com/NVIDIA-NeMo/DataDesigner/tree/main/docs/assets/recipes/vlm_long_doc
Related papers and models:
- Nemotron Nano V2 VL (report): https://arxiv.org/abs/2511.03929
- Nemotron 3 general (report): https://arxiv.org/abs/2512.20856
- C-RADIOv4-H: https://huggingface.co/nvidia/C-RADIOv4-H
- Parakeet-TDT: https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3
- Megatron-LM: https://github.com/NVIDIA/Megatron-LM
- Transformer Engine: https://github.com/NVIDIA/TransformerEngine
- Megatron Energon: https://github.com/NVIDIA/Megatron-Energon
Final reflection
Nemotron 3 Nano Omni shows that native integration of audio, video and image with an LLM backbone can deliver both better accuracy and better efficiency in real scenarios. What does that mean for you? If you work with long documents, multimedia files or interactive streams, there are now open models that reduce the friction between perception and reasoning. Still, responsible adoption requires validation, governance and human review for critical tasks.
Original source
https://huggingface.co/blog/nvidia/nemotron-3-nano-omni-multimodal-intelligence
