NVIDIA NeMo Retriever introduces a generalizable agentic pipeline | Keryc
NVIDIA NeMo Retriever announces an agentic pipeline that prioritizes generalizability over dataset-specific tricks. What do you get? The same design hit #1 on ViDoRe v3 and #2 on the demanding BRIGHT benchmark, showing that an agentic architecture can handle visual search and deep reasoning without changing the core of the system.
What it is and why it matters
The core idea is simple but powerful: combine the best of two worlds. Large language models reason and plan well, but they can’t scan millions of documents at once. Retrievers sweep large corpora fast but lack iterative reasoning. The solution? An active loop between the LLM and the retriever: the agent thinks, generates better queries, retrieves, evaluates and repeats until it converges.
Is this just about improving semantic similarity? Not at all. When documents are visually complex or questions require multi-step logic, you need iterative search, persistent reformulation and query decomposition. That’s exactly what the NeMo pipeline implements: an agent that acts, re-evaluates and synthesizes results.
How it works (agentic architecture)
The pipeline follows a ReACT-style architecture: rather than a single query, the agent iterates. It uses internal tools like think to plan, retrieve(query, top_k) to explore the corpus and final_results to return the most relevant documents.
Emergent patterns they observed:
The agent dynamically generates better queries as it discovers new information.
It persistently rephrases until it finds useful signals.
It breaks multi-step queries into clear, manageable subqueries.
As a safety measure, when step or context limits are reached the pipeline falls back to Reciprocal Rank Fusion (RRF) to combine the ranking lists from all agent calls.
Practical optimization: MCP vs in-process retriever singleton
At first they used an MCP server to give the LLM access to external tools. Sounds fine in theory, but in practice it caused friction: every experiment required spinning up an MCP, loading the corpus on GPU, orchestrating processes and dealing with latency from network round-trips. That slows experimentation and increases the chance of silent failures.
The clever fix was to replace the server with a thread-safe retriever singleton that lives in the same process. This singleton loads the model and embeddings once, protects access with a reentrant lock and exposes the same retrieve() interface to multiple concurrent tasks. Benefits:
Eliminates network serialization and reduces latency.
Improves GPU utilization and experiment turnaround time.
Reduces deployment errors and operational complexity.
Key results on ViDoRe v3 and BRIGHT
NeMo Agentic Retrieval (Opus 4.5 + nemotron-colembed-vl-8b-v2) reached NDCG@10 = 69.22 and placed #1 on ViDoRe v3. On BRIGHT, which emphasizes reasoning more, the same architecture placed #2 with NDCG@10 = 50.90.
BRIGHT: INF-X-Retriever leads with 63.40; NeMo agentic sits second with 50.90.
Operational data (measured examples):
On ViDoRe, Opus 4.5 averaged ~136.3 seconds per query and about 9.2 retrieval calls per query.
On BRIGHT, Opus 4.5 averaged ~148.2 seconds per query and ~11.8 calls.
Yes, the agent is much slower than a dense retriever, but it delivers a quality jump on complex tasks.
Ablations and technical lessons
Several practical takeaways from their experiments:
Model choice: swapping Opus 4.5 for the open gpt-oss-120b on ViDoRe causes a moderate drop (69.22 -> 66.38 NDCG@10) and reduces the number of retrieval calls. On BRIGHT the gap is larger: deep-reasoning tasks still benefit significantly from frontier models like Opus.
Embeddings: using specialized embeddings (nemotron-colembed-vl-8b-v2 for ViDoRe and llama-embed-nemotron-reasoning-3b for BRIGHT) raises the performance ceiling. A strong retriever gives the agent more room to shine.
Cross-domain robustness: highly tuned domain-specific solutions (for example INF-Query-Aligner) don’t always beat a dense baseline in other domains. The agentic loop tends to adapt better without domain-specific heuristics.
The agent narrows the gap between strong and weak embeddings. In tests, the agent reduced differences of 8–19 points down to about 4–7 points depending on the case.
Cost, latency and when to use it
There’s no free lunch: agentic retrieval consumes more tokens and time. On ViDoRe they report numbers on the order of 760k input tokens and 6.3k output tokens per query, measured on an A100 with one concurrent Claude call to reflect real timings.
So, should you use it? Think about it this way:
If your query is simple and the corpus is well aligned, dense retrieval is fast and sufficient.
If the query is complex, multi-step or the documents are visually rich, the agentic approach offers accuracy that can justify the cost.
Where they’re headed: distillation and lightweight agents
The team plans to cut costs: they’re working to distill agentic reasoning patterns into smaller, open models. The idea is to train smaller models to natively orchestrate the think + retrieve loop, achieving Opus-like accuracy with much lower latency and cost.
Also, the architecture is modular: you can pair your preferred agent with the embedding you choose. For production, they recommend trying llama-nemotron-embed-vl-1b-v2 as a practical starting point.
Final thoughts
NeMo Retriever shows that building multi-step, data-adaptive pipelines is worthwhile — not just chasing isolated benchmarks. Interested in a system that understands visual documents and reasons deeply? If your problem is high-value and complex, the agentic approach deserves your attention.