NVIDIA introduces Nemotron ColEmbed V2, a family of late-interaction models designed for search over complex visual documents. If you work with pages that mix text, tables, charts and images, this is for you: it improves accuracy when retrieving multimodal information in enterprise-style and RAG scenarios.
What is Nemotron ColEmbed V2
They are multivector embeddings models (late-interaction) available in 3B, 4B and 8B parameter sizes. Instead of a single vector per document, here every token in the document produces an embedding. During search, each token in the query compares its embedding to all token-embeddings in the document using the MaxSim operation, and the per-token maxima are summed to get the final score.
Why does that matter? Because it enables fine-grained matches: a table cell, the text inside a figure, or a small label can influence the result—things that get diluted when the whole document is reduced to a single vector.
Architecture and training (technical)
- Models: llama-nemotron-colembed-vl-3b-v2 (3B), nemotron-colembed-vl-4b-v2 (4B) and nemotron-colembed-vl-8b-v2 (8B).
- Base encoders: combinations of modern VLMs like google/siglip2-giant-opt-patch16-384, meta-llama/Llama-3.2-3B and Qwen3-VL-8B/4B.
- Bidirectional self-attention: they replace the causal attention of decoder LLMs with bidirectional self-attention, which lets the model learn richer representations of the full context.
- Bi-encoder + contrast: each query and document are encoded separately; similarity for positive pairs is maximized and minimized against negatives, using hard negative mining to improve discrimination.
- Training pipeline: for example, the 3B model went through two stages: first fine-tuning with 12.5M textQA pairs and then with text-image pairs. The 4B and 8B models were fine-tuned with text-image pairs.
- V2 improvements: post-training model merging to combine checkpoints like an ensemble without extra latency, and enrichment of the training set with synthetic multilingual data.
MaxSim mechanism and storage cost
The idea is ColBERT adapted to multimodal: for each query token you take the maximum similarity with the document tokens (MaxSim) and sum those maxima. The result is very effective but requires storing embeddings per token for your whole corpus, which increases storage needs and the complexity of the search pipeline.
In short: higher precision at the cost of more space and extra design work for indexing and retrieval infrastructure. Worried about storage? That’s a common concern, and you’ll likely need to balance precision against engineering constraints.
Performance on ViDoRe V3
ViDoRe V3 is the reference benchmark for visual document retrieval in enterprise settings. On average NDCG@10 (public + private) the models scored:
| Model | Emb_dim | Parameters | ViDoRe V3 NDCG@10 |
|---|---|---|---|
| nemotron-colembed-vl-8b-v2 | 4096 | 8.8B | 63.42 |
| nemotron-colembed-vl-4b-v2 | 2560 | 4.8B | 61.54 |
| llama-nemotron-colembed-vl-3b-v2 | 3072 | 4.4B | 59.79 |
| lama-nemoretriever-colembed-3b-v1 | 3072 | 4.4B | 57.26 |
Key takeaway: the 8B model placed first on ViDoRe V3, while the 4B and 3B ranked 3rd and 6th in their weight ranges. That shows late-interaction is paying off in precision for multimodal scenarios.
Use cases and tradeoffs
- Ideal cases: multimedia search engines, RAG systems that need to read pages with tables and charts, chatbots with visual understanding, and compliance tools that look for specific information in scanned documents.
- Tradeoffs: better semantic results vs higher storage and indexing design. If latency and storage cost are critical in your environment, a single-vector model remains competitive in throughput and operational cost.
How to get started
The nemotron-colembed-vl-8b-v2, nemotron-colembed-vl-4b-v2 and llama-nemotron-colembed-vl-3b-v2 models are available to download on Hugging Face. They’re a good base if you want to experiment with high-precision multimodal retrieval, integrate multimodal RAG, or evaluate the impact of late-interaction on your real collections.
So what now? If your priority is accuracy on visually complex documents, these models are a solid bet. If you need cheap deployment and high query rates, consider hybrids: use a single-vector model for fast filtering and then late-interaction for detailed re-ranking.
