Nemotron 3 Nano: open, efficient model for agents | Keryc
NVIDIA presents Nemotron 3 Nano, a model built for the next generation of agents and multi-agent systems: fast, with ultra-long context and fully open. Why does this matter to you? When millions of tokens flow between agents, speed, memory, and reliability stop being luxuries and become requirements.
What is Nemotron 3 Nano
Nemotron 3 Nano is a 31.6B-parameter model designed to behave like a much larger one thanks to a hybrid architecture and sparse layers. It combines Mamba-2 for long context and low latency with high-precision Transformer layers, and replaces traditional FFNs with a Mixture-of-Experts (MoE) that activates only a fraction of parameters per token.
31.6B total parameters, approximately 3.6B active per token thanks to MoE routing.
Hybrid Mamba-Transformer architecture with interleaved layers and GQA attention.
Learned MLP router that activates 6 of 128 experts per forward step.
Context window up to 1M tokens for long-range workflows.
So what's the result? High reasoning capacity and low latency — ideal for agents that must keep persistent memory and run specialized subtasks at scale.
Architecture and performance
Nemotron 3 Nano mixes Mamba-2 to keep latency low on long windows, and Transformer attention for fine-grained reasoning. The key is the sparse MoE that lets you run fewer parameters per token without losing quality.
In benchmarks, Nano reaches up to 3.3x higher throughput than Qwen3-30B in an 8K/16K setup on a single H200 GPU, and 2.2x over GPT-OSS-20B at the same scale. NVIDIA reports up to 4x improvements versus Nemotron Nano 2. It also keeps explicit controls for reasoning: Reasoning ON/OFF modes and a configurable thinking budget to cap “thought” tokens and make costs predictable.
Training data and pipeline
Training was massive and multi-stage:
Pretraining on ~25 trillion tokens (including 2.5T of new Common Crawl).
Open release of ~3T additional tokens for the Nemotron-Pretraining series.
Post-training: 13 million cross-disciplinary samples to refine reasoning.
SFT + two RL stages (RLVR and RLHF) to specialize for agents, tool use and chat.
To extend the context they added a continued pre-training (CPT) stage with sequences to 512k, and mixed training with 512k and 4k sequences to preserve short-benchmark performance while scaling long-context ability.
Synthetic data was used to support long-range retrieval, multi-hop reasoning and aggregation across documents. NVIDIA emphasizes quality over sheer volume: stronger filters, rewriting and recovery of about half a trillion tokens of code and math that were previously discarded.
Reinforcement training and NeMo Gym
Nemotron 3 Nano followed a combined path: SFT followed by RLVR (with GRPO, synchronous Group Relative Policy Optimization) and RLHF. They also trained a generative reward model (GenRM) — trained on Qwen3-235B — that compares and scores candidate responses to guide RLHF.
To make this reproducible and scalable NVIDIA releases NeMo Gym, an open-source library that:
Provides ready environments for math, code, tool use and agents.
Integrates with NeMo RL and allows interoperability with other frameworks.
Orchestrates high-rate rollouts, eases building environments with verifiable reward logic, and supports distributed deployments.
NeMo Gym was created to separate the RL environment from the training loop, making reuse, auditability and rollout scaling easier in complex trainings.
Security and tools for responsible deployment
NVIDIA publishes nearly 11k traces labeled for agent safety, useful to evaluate risks in tool-based workflows. They also release a good portion of datasets, training recipes and frameworks so external teams can test, extend or mitigate failures before production.
The license is nvidia-open-model-license and the explicit intent is to open weights, recipes and data so the community can reproduce and improve results.
Use cases and practical deployment
What can you do with this today? A few examples:
Agents that program and debug code with context from huge projects.
Scientific assistants that aggregate evidence across long documents and perform multi-hop reasoning.
Parallel agent systems in companies that need fast responses and persistent memory.
Deployment options already supported: vLLM, TRT-LLM, SGLang, endpoints on OpenRouter and build.nvidia.com, and edge executables via llama.cpp, LM Studio and Unsloth.
Balance of tradeoffs and why it matters
NVIDIA bets on a middle ground: keep latency low and costs reasonable without sacrificing reasoning quality. Using MoE and the Mamba-Transformer mix aims to make a mid-sized model act like a large one when it matters, without multiplying costs when many agents run in parallel.
Does this mean Nemotron 3 Nano is the perfect solution? No. Every MoE adds routing and monitoring complexity, and large-scale RL environments remain challenging. But by opening weights, data and tools, NVIDIA makes it easier for the community to test, improve and understand those tradeoffs.
If you work on agents, long-running dialogue systems, or products that need persistent memory and reliable reasoning, this release gives you a practical base to experiment with today.