NVIDIA launches Nemotron RAG for multimodal search

Do you have stacks of PDFs, reports with charts, scanned contracts or presentations and wonder why search systems keep failing? Not magic: many systems only search text and lose the visual information and layout. NVIDIA introduces two small, practical Nemotron models that improve accuracy and reduce latency in multimodal searches over visual documents.

Qué lanzó NVIDIA y por qué importa

NVIDIA publishes two models designed for multimodal Retrieval-Augmented Generation (RAG) that work with standard vector DBs and are small enough for common GPUs:

llama-nemotron-embed-vl-1b-v2: dense embedding of image + text per page (single-vector), 2048 dimensions, built for page-level search with millisecond latency.
llama-nemotron-rerank-vl-1b-v2: cross-encoder reranker that reorders the top-k candidates to improve relevance before passing context to a VLM.

Why does this change practice? Because multimodal embeddings decide which pages reach the language model, and the reranker decides which pages actually influence the answer. If either step fails, the VLM can confidently invent. Using image+text embeddings plus a multimodal reranker reduces those hallucinations without inflating prompts.

Model	Text	Image	Image + Text
`llama-nemotron-embed-1b-v2`	69.35%	-	-
`llama-3.2-nemoretriever-1b-vlm-embed-v1`	71.07%	70.46%	71.71%
`llama-nemotron-embed-vl-1b-v2`	71.04%	71.20%	73.24%
`llama-nemotron-embed-vl-1b-v2 + llama-nemotron-rerank-vl-1b-v2`	76.12%	76.12%	77.64%

Model	Text	Image	Image+Text
`llama-nemotron-rerank-vl-1b-v2`	76.12%	76.12%	77.64%
`jina-reranker-m0`	69.31%	78.33%	NA
`MonoQwen2-VL-v0.1`	74.70%	75.80%	75.98%

Qué lanzó NVIDIA y por qué importa

Arquitectura y detalles técnicos

Resultados en benchmarks (Recall@5)

Casos de uso concretos (cómo empresas lo aplican hoy)

Cómo integrar estos modelos en tu stack RAG

Reflexión final

Fuente original

Stay up to date!

NVIDIA launches Nemotron RAG for multimodal search