Do you have stacks of PDFs, reports with charts, scanned contracts or presentations and wonder why search systems keep failing? Not magic: many systems only search text and lose the visual information and layout. NVIDIA introduces two small, practical Nemotron models that improve accuracy and reduce latency in multimodal searches over visual documents.
Qué lanzó NVIDIA y por qué importa
NVIDIA publishes two models designed for multimodal Retrieval-Augmented Generation (RAG) that work with standard vector DBs and are small enough for common GPUs:
llama-nemotron-embed-vl-1b-v2: dense embedding of image + text per page (single-vector), 2048 dimensions, built for page-level search with millisecond latency.llama-nemotron-rerank-vl-1b-v2: cross-encoder reranker that reorders the top-k candidates to improve relevance before passing context to a VLM.
Why does this change practice? Because multimodal embeddings decide which pages reach the language model, and the reranker decides which pages actually influence the answer. If either step fails, the VLM can confidently invent. Using image+text embeddings plus a multimodal reranker reduces those hallucinations without inflating prompts.
Arquitectura y detalles técnicos
-
Size and family: both models have roughly 1.7B parameters and are fine-tunes of the NVIDIA Eagle family, using Llama 3.2 1B as the text backbone and a SigLip2 visual encoder of 400M.
-
llama-nemotron-embed-vl-1b-v2:- Bi-encoder architecture: encodes query and document separately.
- Pooling: mean pooling over the LM's final tokens to produce a single 2048-dimension vector.
- Training: contrastive learning to pull queries closer to relevant documents and push away negatives.
- Format: single dense vector per page for compatibility with any vector DB.
-
llama-nemotron-rerank-vl-1b-v2:- Cross-encoder: decodes query and page together for fine scoring.
- Output: aggregation via mean pooling + binary classification head.
- Loss: CrossEntropy; trained with public datasets and synthetic examples.
-
Multimodal ingestion: the Image+Text modality feeds the encoder with the page image plus the extracted text (for example with NV-Ingest), yielding representations that are truer to the real document.
Resultados en benchmarks (Recall@5)
NVIDIA evaluated both models on five visual document retrieval sets: ViDoRe V1/V2/V3, DigitalCorpora-10k and Earnings V2. Here are the average numbers (Recall@5) focused on commercially viable dense models:
| Model | Text | Image | Image + Text |
|---|---|---|---|
llama-nemotron-embed-1b-v2 | 69.35% | - | - |
llama-3.2-nemoretriever-1b-vlm-embed-v1 | 71.07% | 70.46% | 71.71% |
llama-nemotron-embed-vl-1b-v2 | 71.04% | 71.20% | 73.24% |
llama-nemotron-embed-vl-1b-v2 + llama-nemotron-rerank-vl-1b-v2 | 76.12% | 76.12% | 77.64% |
The reranker brings a clear improvement: adding the reordering step raises Recall@5 by several percentage points, which usually translates into more correct answers and fewer inventions by the VLM.
Comparison of the reranker versus public alternatives:
| Model | Text | Image | Image+Text |
|---|---|---|---|
llama-nemotron-rerank-vl-1b-v2 | 76.12% | 76.12% | 77.64% |
jina-reranker-m0 | 69.31% | 78.33% | NA |
MonoQwen2-VL-v0.1 | 74.70% | 75.80% | 75.98% |
Note: jina-reranker-m0 performs well on image-only but its public license is non-commercial (CC-BY-NC). llama-nemotron-rerank-vl-1b-v2 offers better coverage on Text and Image+Text with a permissive license for enterprise use.
Casos de uso concretos (cómo empresas lo aplican hoy)
-
Cadence: models design and verification documents as connected multimodal documents. An engineer can ask for specific sections of a spec and get the pages with diagrams and related requirements.
-
IBM Storage: indexes manual and guide pages, prioritizing pages where domain terms appear in the correct context before sending them to LLMs. This improves interpretation and reasoning about complex infrastructures.
-
ServiceNow: in 'Chat with PDF' experiences they use multimodal embeddings for indexing and the reranker to prioritize pages by query, keeping conversations coherent over large PDF collections.
Cómo integrar estos modelos en tu stack RAG
- Ingest: extract text and save the image of each page (NV-Ingest is an option mentioned).
- Indexing: run
llama-nemotron-embed-vl-1b-v2and store a dense vector per page in your preferred vector DB. - Retrieval: do a top-k by similarity (milliseconds at enterprise scale).
- Reranking: apply
llama-nemotron-rerank-vl-1b-v2over the top-k to reorder without changing your index. - Generation: concatenate the top reordered pages as context for your VLM and generate more grounded answers.
Practical tip: don’t try to compensate poor embeddings with huge prompts. Better invest in improving retrieval and reranking: that’s where you reduce false generation.
Reflexión final
The most interesting bit here is pragmatics: relatively small models (1.7B) that fit on common GPUs can turn a search over PDFs into a real multimodal experience. Less latency, compatibility with standard vector DBs and a clear Recall@5 improvement mean you don’t need a massive model to deliver useful results in enterprise apps. Ready for your agents to understand images and layout, not just the text?
