NVIDIA publishes open recipe to evaluate Nemotron 3 Nano

Dec 17, 20254 minutes

Model evaluation can’t be a sleight of hand anymore: how do you know if a score reflects a real improvement or just a change in configuration? NVIDIA answers by publishing Nemotron 3 Nano 30B A3B along with the full evaluation recipe, built with the open NeMo Evaluator library, so anyone can repeat, inspect, and audit the results.

What NVIDIA publishes and why it matters

NVIDIA isn’t just sharing numbers; they’re sharing the methodology. The release includes the YAML configuration used for the evaluation, the generated artifacts, and the standard way to run the whole flow with NeMo Evaluator. Why does that matter? Because many evaluations left out critical details: prompts, harness versions, runtime parameters, retries, timeouts and logs.

Without that information, model comparisons become unreliable. With the open recipe you get a reproducible baseline: if you change something, you document it; if you reproduce, you verify. That turns a table of scores into an auditable experiment.

NeMo Evaluator: architecture and advantages

NeMo Evaluator acts as a consistent orchestration layer. It’s not a new benchmark runner trying to replace others; rather it unifies multiple harnesses under a common interface. It keeps each harness’s native logic, but standardizes how they are configured, executed and recorded.

Main technical ideas (technical but clear):

Separation between evaluation pipeline and inference backend: you can point to hosted endpoints, local deployments or third-party providers without rewriting the evaluation.
Integration of multiple harnesses: NeMo Skills, LM Evaluation Harness and others, each with its own scoring semantics, but all registered consistently.
Structured and reproducible outputs: results.json per task, execution logs and artifacts organized by task for audit and deep analysis.

This makes two practical things easier: running heterogeneous suites with a single configuration and keeping comparable results even if the inference infrastructure changes.

Advantages for teams and projects

Reuse methodology across releases and models.
Avoid ad-hoc scripts that change from one release to another.
Support everything from quick tests to large-scale evaluations with a launcher, artifact layout and reusable configuration.

Benchmarks and results for Nemotron 3 Nano 30B A3B

The recipe reproduces a diversified battery of benchmarks. Here’s the table with the published scores:

Benchmark	Accuracy	Categoría	Descripción
BFCL v4	53.8	Function Calling	Berkeley Function Calling Leaderboard v4
LiveCodeBench (v6 2025-08–2025-05)	68.3	Coding	Real-world programming problems
MMLU-Pro	78.3	Knowledge	Multi-task language evaluation (10-choice)
GPQA	73.0	Science	Graduate-level science questions
AIME 2025	89.1	Mathematics	American Invitational Mathematics Exam
SciCode	33.3	Scientific Coding	Scientific programming challenges
IFBench	71.5	Instruction Following	Instruction-following benchmarks
HLE	10.6	Humanity's Last Exam	Expert questions across domains

For model card and technical report details, NVIDIA also publishes the Nemotron 3 Nano 30B A3B Model Card and the Nemotron Technical Report.

How to reproduce the evaluation (step by step)

If you’re a developer or you research models, here’s the minimal flow to reproduce the evaluation in your environment.

Install the tool:

pip install nemo-evaluator-launcher

Prepare credentials (example):

export NGC_API_KEY=your-ngc-api-key
export HF_TOKEN=your-huggingface-token
export JUDGE_API_KEY=your-judge-api-key # only for judge-based benchmarks

Run the evaluation using the published configuration (example pointing to NVIDIA’s endpoint):

nemo-evaluator-launcher run \
  --config /path/to/examples/nemotron/local_nvidia_nemotron_3_nano_30b_a3b.yaml

To use another endpoint (for example local):

nemo-evaluator-launcher run \
  --config local_nvidia_nemotron_3_nano_30b_a3b.yaml \
  -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

To preview without running: --dry-run.
For quick tests, limit samples: -o evaluation.nemo_evaluator_config.config.params.limit_samples=10.
Run specific benchmarks with -t, for example -t ns_mmlu_pro.

Typical results and layout:

results_nvidia_nemotron_3_nano_30b_a3b/
├── artifacts/
│   └── <task_name>/
│       └── results.json
└── logs/
    └── stdout.log

Sources of variation and best practices to reproduce

Don’t expect bit-wise identical outputs. LLMs introduce nondeterminism: decoding settings, repeats, scoring by automated judges, parallel execution and differences in serving infrastructure.

To align your evaluation with the reference:

Use the published YAML without changes, or explicitly document any modification.
Run the benchmark versions and prompt templates indicated.
Verify you’re pointing to the correct model and chat template on the endpoint.
Keep runtime parameters the same: repeats, parallelism, timeouts and retries.
Check that artifacts and logs follow the expected structure.

If those elements match, your reproduction is valid even if there are small numeric fluctuations.

Impact and next steps for the community

This recipe represents a cultural shift: moving from closed results to evaluations with full traceability. What does this mean for you? More confidence when comparing models, better ability to audit claims and a foundation to build automated evaluations and CI pipelines.

NeMo Evaluator is open source and welcomes collaboration. Want a new benchmark or improvements in the infra? Open an issue or contribute on GitHub. Also, for organizations that need automated evaluations at scale, NVIDIA offers an enterprise microservice option built on the same principles.

Separating methodology from infrastructure, recording every artifact and publishing the full recipe turns a number into a verifiable claim. That’s the goal: audible, reproducible evaluations that are useful to the community.

Original source

https://huggingface.co/blog/nvidia/nemotron-3-nano-evaluation-recipe

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.