Multimodal finetuning with Sentence Transformers for VDR

Apr 16, 2026Keryc Díaz4 minutes

In this article I walk you through, step by step and with practical examples, how to finetune multimodal models in Sentence Transformers for Visual Document Retrieval (VDR).

Interested in improving retrieval of document screenshots while keeping charts, tables and layouts intact? Then this is for you.

Why finetune a multimodal model

General multimodal models, like Qwen/Qwen3-VL-Embedding-2B, are trained to serve many tasks: image-text matching, VQA, document understanding, and more. But generality comes at a cost: they're rarely optimal for a specific task.

And VDR isn't the same as searching for a photo of sneakers, right? Exactly. For VDR you need to understand layouts, tables and charts in screenshots. By finetuning with domain data, the model learns specialized patterns and improves substantially.

In the experiment we review, the finetuned model tomaarsen/Qwen3-VL-Embedding-2B-vdr rises from NDCG@10 0.888 to 0.947 on the evaluation set. That puts it ahead of much larger models. The lesson? Finetuning on your domain usually beats picking a bigger, generic model.

Components of multimodal training

Training multimodal models with Sentence Transformers follows the same recipe as for text, but with some practical nuances:

Model: multimodal backbone or a combination of encoders.
Dataset: inputs mixing text and images (or video, audio if applicable).
Loss function: guides optimization for retrieval or rerank tasks.
Training arguments: batch, precision, logging, etc.
Evaluator: retrieval metrics like NDCG@10, MAP and Recall@k.
Trainer: coordinates all of the above.

Below I break down each piece with concrete examples you can run or adapt.

Model: VLM finetune vs Router

Most often you finetune an existing multimodal model (for example Qwen3-VL-Embedding-2B). The library automatically detects the modalities supported by the processor and configures the forward and pooling.

Example of loading with precision parameters and image resolution control:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
  "Qwen/Qwen3-VL-Embedding-2B",
  model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": "bfloat16"},
  processor_kwargs={"min_pixels": 28 * 28, "max_pixels": 600 * 600},
)

Alternative: build a multimodal model by composition with Router, combining lightweight encoders per modality. It's useful when you want fine control over each encoder or to save resources.

Dataset: the practical example

For the tutorial we used tomaarsen/llamaindex-vdr-en-train-preprocessed, a filtered English version with ~53k examples. The training format was triplets (query, image, negative_0) and for evaluation we kept 4 hard negatives per query.

Load the splits:

from datasets import load_dataset
train_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "train", split="train")
train_dataset = train_dataset.select_columns(["query", "image", "negative_0"])
eval_dataset = load_dataset("tomaarsen/llamaindex-vdr-en-train-preprocessed", "eval", split="train")

Practical rule: if your loss requires a label or score column, make sure that column exists. Multimodal entries can be PIL, paths, URLs or arrays.

Loss function: CachedMultipleNegativesRankingLoss and Matryoshka

For retrieval we used CachedMultipleNegativesRankingLoss, which combines hard negatives (columns) and in-batch negatives. Its cached variant enables large effective batch sizes with less memory.

Mini example:

from sentence_transformers.sentence_transformer.losses import CachedMultipleNegativesRankingLoss, MatryoshkaLoss
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=1)
loss = MatryoshkaLoss(model, loss, matryoshka_dims=[2048,1536,1024,512,256,128,64])

Why Matryoshka? It trains the model so that truncating the embedding dimension still performs well. Practical result: you can deploy embeddings of 512 or 256 dimensions to save storage and latency with minimal quality loss.

Training arguments: practical tips

Configuration used in the experiment:

from sentence_transformers.sentence_transformer.training_args import SentenceTransformerTrainingArguments, BatchSamplers
args = SentenceTransformerTrainingArguments(
  output_dir="models/Qwen3-VL-Embedding-2B-vdr",
  num_train_epochs=1,
  per_device_train_batch_size=64,
  per_device_eval_batch_size=64,
  learning_rate=2e-5,
  warmup_ratio=0.1,
  fp16=False,  # usar BF16 en VLMs
  bf16=True,
  batch_sampler=BatchSamplers.NO_DUPLICATES,
  eval_strategy="steps",
  eval_steps=0.1,
  save_strategy="steps",
  save_steps=0.1,
  save_total_limit=2,
  logging_steps=0.05,
)

Key points:

bf16=True is preferable for numerical stability in vision-language models.
batch_sampler=NO_DUPLICATES avoids duplicates and improves signal for multiple-negatives type losses.
per_device_train_batch_size may look large for a 2B VLM, but mini_batch_size=1 in the loss and caching make memory manageable.
Using fractions for eval_steps and save_steps makes evaluation and saving happen during the epoch, useful for monitoring.

Evaluator: measuring retrieval correctly

To track progress we used InformationRetrievalEvaluator with text queries and an image corpus that includes hard negatives. Example building corpus and evaluator:

from sentence_transformers.sentence_transformer.evaluation import InformationRetrievalEvaluator
# construir eval_queries y eval_corpus, añadir negativos con offsets
eval_evaluator = InformationRetrievalEvaluator(
  queries=eval_queries,
  corpus=eval_corpus,
  relevant_docs=eval_relevant_docs,
  batch_size=1,
  show_progress_bar=True,
  name="vdr-eval-hard",
)

Using batch_size=1 here avoids OOMs on large models.

Trainer: bring it all together

The final script assembles model, data, loss, args and evaluator into SentenceTransformerTrainer and launches training with trainer.train().

Results: what you can expect

In just one epoch of finetuning, the model tomaarsen/Qwen3-VL-Embedding-2B-vdr reached NDCG@10 = 0.947 on the evaluation set (300 queries, 1500 documents, cosine similarity). The base model Qwen/Qwen3-VL-Embedding-2B scored 0.888.

The comparison table showed the finetuned model outperformed 20 tested models, including some 8B-parameter models. With full 2048-dimensional embeddings the peak was 0.948, but thanks to Matryoshka performance stays almost intact even truncating to 512 or 1024 dimensions.

Robustness summary by dimension:

2048d: finetuned 0.948
1024d: finetuned 0.946
512d: finetuned 0.945
256d: finetuned 0.937
64d: finetuned 0.876

Practical decision: the author saved the model with truncate_dim=1024 as a balance between quality and storage.

Multimodal rerankers: another path

You can also train multimodal rerankers (Cross Encoders) with the same infrastructure, using CrossEncoderTrainer and losses like BinaryCrossEntropyLoss.

Two common architectures:

Any-to-Any + LogitScore: uses the LM head to generate tokens and computes log-odds.
Feature Extraction + Pooling + Dense: extracts features and projects to a score, more cost-efficient at inference.

The flow is similar: build datasets per direction (image_to_text and text_to_image), create task prompts and train with positive and negative examples.

Practical tips for your project

If your corpus has consistent layouts or chart types, finetuning will give large gains.
Start with one epoch and evaluate; sometimes 1-2 epochs are enough for VDR with synthetic or well-curated data.
Use Matryoshka if you plan to deploy under storage or latency constraints.
Prefer BF16 over FP16 in VLMs when your hardware supports it.
If memory is a problem, CachedMultipleNegativesRankingLoss with mini_batch_size=1 is your friend.

Don't want to touch a giant VLM? Try the Router approach with lighter encoders and then train them to align embedding spaces.

Final reflection

Finetuning multimodal models isn't just for labs with GPU clusters anymore. With tools like Sentence Transformers you can adapt a VLM to your domain and achieve significant quality jumps in real tasks like VDR. Investing in domain data and a smart training setup often pays off more than just increasing model size.

Source

https://huggingface.co/blog/train-multimodal-sentence-transformers

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.