NVIDIA AI-Q leads DeepResearch Bench I and II | Keryc
NVIDIA AI-Q reached first place in DeepResearch Bench I (55.95) and DeepResearch Bench II (54.50). Why does this matter? Because it shows that an open, configurable, and reproducible stack can compete in complex automated-research tasks: retrieving evidence, synthesizing analysis, and producing high-quality cited reports.
What NVIDIA AI-Q achieved
AI-Q is not just a model: it's an open blueprint to build research agents that work over enterprise and web data and deliver answers with verifiable citations. With a single configurable stack, NVIDIA achieved top performance on two complementary benchmarks that measure both narrative quality and fine-grained factual correctness.
DeepResearch Bench I rewards the final report quality: comprehensiveness, depth of insight, obedience to instructions, and readability. DeepResearch Bench II uses more than 70 binary rubrics per task to evaluate information retrieval, analysis, and presentation. Leading both means AI-Q doesn't just write well: it also finds and analyzes the right evidence.
Core architecture: multi-agent and modular
The AI-Q deep researcher architecture is made of three main roles: Orchestrator, Planner, and Researcher. Each can use a different LLM and operate in its own context window, which prevents long, noisy responses from degrading planning.
Orchestrator: coordinates the research loop, calls the Planner, dispatches tasks to the Researcher, fills gaps and produces the long report.
Planner: in two phases (Scout and Architect) it maps the information landscape and designs a research plan with queries and quality constraints.
Researcher: launches specialist subagents in parallel (Evidence Gatherer, Mechanism Explorer, Comparator, Critic and Horizon Scanner) and synthesizes findings into a cited brief.
Optionally, an ensemble layer runs multiple pipelines in parallel and a post-hoc refiner polishes the final report.
Open and reproducible stack
The competition implementation relies on available, configurable components:
NeMo Agent Toolkit for wiring workflows, function logging and evaluation. It enables composition via YAML.
LangChain DeepAgents for the planner–researcher–orchestrator flow with middleware for subagents.
NVIDIA Nemotron 3 models finely tuned for synthesis and tool calls.
Search tools: Tavily for the web and Serper for academic papers.
That flexibility means you can swap LLMs, tools, and agent graphs depending on your use case.
Data, trajectory generation and fine-tuning
NVIDIA built the training base with these stages:
Question collection: ~17k from OpenScholar, 21k from ResearchQA and 2,457 from Fathom-DeepResearch-SFT.
Trajectory generation: ~80k full-workflow trajectories using GPT-OSS-120B as generator. These trajectories include real search results via Tavily and Serper.
Principle-based filtering: completed trajectories were judged with nvidia/Qwen3-Nemotron-32B-GenRM-Principle and ~67k high-quality ones were retained.
SFT: model Nemotron-3-Super-120B-A12B, 1 epoch, 5,615 steps, ~25 hours on 16×8 NVIDIA H100 GPUs.
That trajectory dataset teaches the model to plan, perform multi-step searches and synthesize with real citations.
Middleware for reliability on long horizons
Long runs (32+ tool calls) expose failures that don't show up in short interactions. NVIDIA added specific middleware:
Tool-name sanitization: cleaning, alias resolution and fuzzy matching when the LLM invents names.
Reasoning-aware retry: detects 'thoughts' without a final answer and preserves context for retries.
Budget enforcement: per-agent limits that force synthesis when tool calls run out.
Report validation: minimum checks on length and structure; if they fail, a continuation prompt retries.
Each component addresses real failure patterns observed in agent traces.
Ensemble and refiner: boosting coverage and polish
The ensemble runs N independent pipelines; an LLM then merges outputs, choosing structure and adding unique content to broaden evidence coverage. The refiner pass rewrites to quantify vague claims, improve entity coverage, cut scaffolding and strengthen causal reasoning.
Practical result: more information recall and better coherence in the final report without losing readability.
Why this approach matters for companies and developers
Transparency and control: the stack is open and configurable, so companies can inspect, audit and adapt every component.
Modularity: you can plug your own LLM into the Planner or Researcher, or connect internal search engines instead of Tavily/Serper.
Reliability for real tasks: the middleware and multi-agent strategy are designed for long, complex runs typical in deep research.
If you work in applied AI or product, this isn't just a future promise: it's a reproducible pattern that already performs well on demanding benchmarks.
NVIDIA AI-Q shows that the path to robust research agents goes through combining a multi-agent architecture, fine-tuning on real trajectories, practical middleware and optional ensemble and refiner steps. What's the takeaway for you? It's not always a single bigger LLM: orchestration, trajectory quality and robustness engineering make the difference when tasks are long and demanding.