Nemotron 3 Nano 4B: Compact AI optimized for the edge | Keryc
Nemotron 3 Nano 4B is NVIDIA's new bet to bring powerful models to the edge: a hybrid Mamba-Transformer model with 4 billion parameters designed to run on Jetson devices, GeForce GPUs and clusters like DGX Spark, with a small VRAM footprint and good instruction following and tool-use behavior.
Qué es Nemotron 3 Nano 4B
It's a hybrid model that combines Mamba (SSM) components with transformer-style layers to achieve efficient reasoning. With 4B parameters, it's specifically optimized for local and edge deployments: Jetson Thor, Jetson Orin Nano, RTX and DGX Spark.
Why does this matter to you? Because it lets you run conversational agents and "agentic" behaviors close to your data, with lower latency, better privacy guarantees and reduced inference costs.
Rendimiento y benchmarks clave
NVIDIA reports top-of-class results for several relevant metrics:
Instruction following: state of the art in its class (IFBench, IFEval).
Gaming agency / intelligence (Orak): also leading at its size, evaluated on tactical games like Super Mario, Darkest Dungeon and Stardew Valley.
VRAM efficiency: minimal footprint in its class under both low and high ISL/OSL configurations.
Latency: best TTFT (time to first token) in its class under high ISL.
Efficiency tests were measured on an RTX 4070 using Llama.cpp with quantized Q4_K_M builds.
Cómo se comprimió y por qué es distinto
Nemotron 3 Nano 4B wasn't trained from scratch: it was created by pruning and distillation from Nemotron Nano 9B v2 using Nemotron Elastic technology. Instead of separate stages, Nemotron Elastic trains a router that performs an architecture search jointly with distillation, optimizing what to prune and by how much to meet a parameter budget.
The router considered four pruning axes:
Mamba heads (number of SSM heads)
Hidden dimension (embedding dimension)
FFN channels (MLP intermediate channels)
Depth (full layers)
Based on convergence for the 4B target, the decisions were (summary):
Axis
Parent 9B v2
Nemotron 3 Nano 4B
Depth
56 layers (27 Mamba, 4 Attention, 25 MLP)
42 layers (21 Mamba, 4 Attention, 17 MLP)
Mamba heads
128
96
FFN intermediate dim
15680
12544
Embedding dim
4480
3136
After defining the pruned architecture, the student model was retrained with distillation from the original 9B.
Etapas de recuperación y post-entrenamiento
Precision recovery was done in two main stages:
Short-context distillation: 8K window, 63B tokens, roughly a 70% post-training / 30% pretraining mix from the parent. This stage recovers initial accuracy.
Long-context extension: 49K window, 150B tokens, to restore abilities on long chains of reasoning.
After that there were two SFT phases using Megatron-LM (reasoning and non-reasoning data) and a three-stage RL pipeline with NeMo-RL to refine instruction following and tool use: from single-turn to multi-turn with NeMo-Gym environments and a preliminary Nemotron-RL-Agentic-Conversational-Tool-Use-Pivot-v1. Training kept a 50-50 balance between reasoning and non-reasoning data and progressively increased the KL penalty.
Cuantización y despliegue en dispositivos
For the edge, quantization is key. Nemotron 3 Nano 4B is released in FP8 and Q4_K_M GGUF for Llama.cpp:
FP8: PTQ with ModelOpt using 1k samples for calibration. Selective quantization was used: keeping some self-attention layers and the 4 prior Mamba layers in BF16 gave the best balance. Weights, activations and KV-cache in FP8; Conv1D in Mamba in BF16. Result: 100% median accuracy recovery versus BF16 and up to 1.8x improvement in latency and throughput on DGX Spark and Jetson Thor.
Q4_K_M (GGUF): the 4-bit version used in Llama.cpp also reached 100% median accuracy recovery and is suitable for Jetson. On Jetson Orin Nano 8GB, the Q4_K_M checkpoint with Llama.cpp delivers 18 tokens/s, up to 2x throughput compared to Nemotron Nano 9B v2.
The model supports multiple inference engines: Transformers, vLLM, TRT-LLM and Llama.cpp, so you can pick the stack that fits your case.
Dónde encaja y casos de uso prácticos
Want a conversational agent that replies fast without sending data to the cloud? Or a robot or a local game NPC that reasons and calls tools? Nemotron 3 Nano 4B is made for that: local agents, embedded assistants, inference on robot fleets, and gaming scenarios with tactical logic.
Because it's open source, you can fine-tune it for a specific domain, experiment with more quantizations, or integrate it with SDKs like NVIGI to accelerate inference alongside graphics workloads.
Recomendaciones rápidas si vas a probarlo
For Jetson: follow the Jetson AI Lab guides and try the Q4_K_M version in Llama.cpp first to evaluate throughput.
If you need maximum server-side accuracy, use FP8 on compatible hardware and compare with BF16 on your workload.
If you're fine-tuning, remember it started as a distillation from 9B: the architecture already retains structured reasoning, so SFT/RL tuning can be more efficient.
Nemotron 3 Nano 4B shows how combining guided structured pruning and distillation can deliver practical models for the edge without giving up reasoning and tool-use capabilities. Ready to try an LLM that fits on embedded devices and performs like a much larger one?