Nemotron ColEmbed V2: the new standard in multimodal retrieval

NVIDIA introduces Nemotron ColEmbed V2, a family of late-interaction models designed for search over complex visual documents. If you work with pages that mix text, tables, charts and images, this is for you: it improves accuracy when retrieving multimodal information in enterprise-style and RAG scenarios.

What is Nemotron ColEmbed V2

They are multivector embeddings models (late-interaction) available in 3B, 4B and 8B parameter sizes. Instead of a single vector per document, here every token in the document produces an embedding. During search, each token in the query compares its embedding to all token-embeddings in the document using the MaxSim operation, and the per-token maxima are summed to get the final score.

Why does that matter? Because it enables fine-grained matches: a table cell, the text inside a figure, or a small label can influence the result—things that get diluted when the whole document is reduced to a single vector.

Model	Emb_dim	Parameters	ViDoRe V3 NDCG@10
nemotron-colembed-vl-8b-v2	4096	8.8B	63.42
nemotron-colembed-vl-4b-v2	2560	4.8B	61.54
llama-nemotron-colembed-vl-3b-v2	3072	4.4B	59.79
lama-nemoretriever-colembed-3b-v1	3072	4.4B	57.26

What is Nemotron ColEmbed V2

Architecture and training (technical)

MaxSim mechanism and storage cost

Performance on ViDoRe V3

Use cases and tradeoffs

How to get started

Original source

Stay up to date!

Nemotron ColEmbed V2: the new standard in multimodal retrieval