NVIDIA publishes open recipe to evaluate Nemotron 3 Nano | Keryc
Model evaluation can’t be a sleight of hand anymore: how do you know if a score reflects a real improvement or just a change in configuration? NVIDIA answers by publishing Nemotron 3 Nano 30B A3B along with the full evaluation recipe, built with the open NeMo Evaluator library, so anyone can repeat, inspect, and audit the results.
What NVIDIA publishes and why it matters
NVIDIA isn’t just sharing numbers; they’re sharing the methodology. The release includes the YAML configuration used for the evaluation, the generated artifacts, and the standard way to run the whole flow with NeMo Evaluator. Why does that matter? Because many evaluations left out critical details: prompts, harness versions, runtime parameters, retries, timeouts and logs.
Without that information, model comparisons become unreliable. With the open recipe you get a reproducible baseline: if you change something, you document it; if you reproduce, you verify. That turns a table of scores into an auditable experiment.
NeMo Evaluator: architecture and advantages
NeMo Evaluator acts as a consistent orchestration layer. It’s not a new benchmark runner trying to replace others; rather it unifies multiple harnesses under a common interface. It keeps each harness’s native logic, but standardizes how they are configured, executed and recorded.
Main technical ideas (technical but clear):
Separation between evaluation pipeline and inference backend: you can point to hosted endpoints, local deployments or third-party providers without rewriting the evaluation.
Integration of multiple harnesses: NeMo Skills, LM Evaluation Harness and others, each with its own scoring semantics, but all registered consistently.
Structured and reproducible outputs: results.json per task, execution logs and artifacts organized by task for audit and deep analysis.
This makes two practical things easier: running heterogeneous suites with a single configuration and keeping comparable results even if the inference infrastructure changes.
Advantages for teams and projects
Reuse methodology across releases and models.
Avoid ad-hoc scripts that change from one release to another.
Support everything from quick tests to large-scale evaluations with a launcher, artifact layout and reusable configuration.
Benchmarks and results for Nemotron 3 Nano 30B A3B
The recipe reproduces a diversified battery of benchmarks. Here’s the table with the published scores:
Benchmark
Accuracy
Categoría
Descripción
BFCL v4
53.8
Function Calling
Berkeley Function Calling Leaderboard v4
LiveCodeBench (v6 2025-08–2025-05)
68.3
Coding
Real-world programming problems
MMLU-Pro
78.3
Knowledge
Multi-task language evaluation (10-choice)
GPQA
73.0
Science
Graduate-level science questions
AIME 2025
89.1
Mathematics
American Invitational Mathematics Exam
SciCode
33.3
Scientific Coding
Scientific programming challenges
IFBench
71.5
Instruction Following
Instruction-following benchmarks
HLE
10.6
Humanity's Last Exam
Expert questions across domains
For model card and technical report details, NVIDIA also publishes the Nemotron 3 Nano 30B A3B Model Card and the Nemotron Technical Report.
How to reproduce the evaluation (step by step)
If you’re a developer or you research models, here’s the minimal flow to reproduce the evaluation in your environment.
Install the tool:
pip install nemo-evaluator-launcher
Prepare credentials (example):
export NGC_API_KEY=your-ngc-api-key
export HF_TOKEN=your-huggingface-token
export JUDGE_API_KEY=your-judge-api-key # only for judge-based benchmarks
Run the evaluation using the published configuration (example pointing to NVIDIA’s endpoint):
nemo-evaluator-launcher run \
--config /path/to/examples/nemotron/local_nvidia_nemotron_3_nano_30b_a3b.yaml
To use another endpoint (for example local):
nemo-evaluator-launcher run \
--config local_nvidia_nemotron_3_nano_30b_a3b.yaml \
-o target.api_endpoint.url=http://localhost:8000/v1/chat/completions
To preview without running: --dry-run.
For quick tests, limit samples: -o evaluation.nemo_evaluator_config.config.params.limit_samples=10.
Run specific benchmarks with -t, for example -t ns_mmlu_pro.
Sources of variation and best practices to reproduce
Don’t expect bit-wise identical outputs. LLMs introduce nondeterminism: decoding settings, repeats, scoring by automated judges, parallel execution and differences in serving infrastructure.
To align your evaluation with the reference:
Use the published YAML without changes, or explicitly document any modification.
Run the benchmark versions and prompt templates indicated.
Verify you’re pointing to the correct model and chat template on the endpoint.
Keep runtime parameters the same: repeats, parallelism, timeouts and retries.
Check that artifacts and logs follow the expected structure.
If those elements match, your reproduction is valid even if there are small numeric fluctuations.
Impact and next steps for the community
This recipe represents a cultural shift: moving from closed results to evaluations with full traceability. What does this mean for you? More confidence when comparing models, better ability to audit claims and a foundation to build automated evaluations and CI pipelines.
NeMo Evaluator is open source and welcomes collaboration. Want a new benchmark or improvements in the infra? Open an issue or contribute on GitHub. Also, for organizations that need automated evaluations at scale, NVIDIA offers an enterprise microservice option built on the same principles.
Separating methodology from infrastructure, recording every artifact and publishing the full recipe turns a number into a verifiable claim. That’s the goal: audible, reproducible evaluations that are useful to the community.