NVIDIA NeMo launches a skill to evaluate LLMs in minutes

Mar 6, 2026Keryc Díaz3 minutes

NVIDIA introduced nel-assistant, a "skill" for agents that turns the setup and execution of LLM evaluations into a practical conversation. Have you ever been frustrated building 200+ line YAML files just to run a single evaluation? This is for you: describe what you want and the agent generates, validates, and runs the right configuration.

What nel-assistant is and why it matters

nel-assistant is a skill built on the NVIDIA NeMo Evaluator library that lets agents (Cursor, Claude Code, Codex and other agentic IDEs) configure, run, and monitor LLM evaluations without you writing YAML by hand.

In practice this means: instead of wrestling with scattered parameters across docs and model cards, the agent asks a few questions, reads the model card, computes hardware tweaks, and generates a structured, validated YAML ready for production.

How it works (technical, but clear)

The typical flow is conversational. The agent starts with five key questions (execution environment, deployment backend, export target, model type, and benchmark category). With those answers it runs a command like:

nel skills build-config \
  --execution local \
  --deployment vllm \
  --model-type chat \
  --benchmarks standard

Behind that command there’s an important technical process:

Modular templates and deep-merge: the skill merges validated YAML snippets (execution, deployment, benchmarks, export) into a final config. That merge prevents syntax errors and invalid combinations.
Automatic model card extraction: it uses web search + extraction (regex and heuristics) to get temperature, top_p, max_model_len and system/chat templates.
Hardware logic: it computes appropriate tensor-parallel and data-parallel settings based on model size and available GPU memory (for example, TP=8 for 2x H100 if applicable).
Reasoning detection: it looks for cues like "reasoning" or "chain-of-thought" and adjusts interceptors (e.g., enabling enable_thinking or parsing <think> tokens for trace caching).

The result: production-ready configs — not free-form YAML that might hallucinate nonexistent flags or mix incompatible backends.

Interactive example

You: Evaluate NVIDIA Nemotron-3-Nano-30B-A3B on standard benchmarks using vLLM locally. Export to Weights & Biases.

The agent detects your environment (NeMo Evaluator 26.01) and asks 5 quick answers. Then it reports detected parameters:

temperature=0.6, top_p=0.95, context=128K
Optimal TP=8 for your 2x H100

It generates Nemotron-3-Nano-30B-A3B.yaml and gives rollout options: dry run, smoke test and full run.

Recommended flow and useful commands

Validate without running: nel run --config nemotron-3-nano.yaml --dry-run
Smoke test (10 samples):

nel run --config nemotron-3-nano.yaml \
  -o ++evaluation.nemo_evaluator_config.config.params.limit_samples=10

Full run: nel run --config nemotron-3-nano.yaml
Monitor from your IDE/agent:

nel status nemotron-3-nano-20260212-143022 && nel info nemotron-3-nano-20260212-143022

Example progress output:

mmlu: 65.2% accuracy (5 hours)
hellaswag: 78.4% accuracy (2 hours)
arc_challenge: 53.8% accuracy (1 hour)

The agent also lets you tweak per-task settings (e.g., temperature=0 for HumanEval, 0.7 for MMLU), change advanced scaling (HAProxy for multi-node >120B) and add reasoning interceptors.

Practical benefits

Time saved: from hours or days to minutes to produce a working config.
Fewer errors: pre-validated templates reduce the chance of invalid syntax or invented flags.
Reproducibility: configs are composed from tested fragments, making experiments easy to repeat.
Pipeline integration: export to CSV, Weights & Biases or MLflow without you hand-crafting URIs or environment variables (the skill asks and inserts them where needed).

Limitations and considerations

Dependence on model cards and web searches: if a model card is incomplete or outdated, the skill may ask for manual confirmations.
Environment variables and permissions: details like SLURM accounts, partition names and API keys remain your responsibility; the skill will ask and help inject them, but it can’t create them for you.
Edge cases in exotic deployments: highly customized integrations or uncommon flags may need human intervention.
Audit requirements: if you need strict control over every YAML line for compliance, review the generated config before running in production.

Impact for teams and developers

For ML infra teams and developers, nel-assistant reduces the friction between researching parameters and running evaluations. For researchers and product managers, it speeds up iterations and model comparisons, freeing time to analyze results instead of debugging configs.

If you work with LLMs and have ever faced an impossible YAML, this skill changes the game: from writing, debugging and searching to conversing, validating and executing.

Original source

https://huggingface.co/blog/nvidia/model-evaluation-skill

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.