Multimodal finetuning with Sentence Transformers for VDR | Keryc
In this article I walk you through, step by step and with practical examples, how to finetune multimodal models in Sentence Transformers for Visual Document Retrieval (VDR).
Interested in improving retrieval of document screenshots while keeping charts, tables and layouts intact? Then this is for you.
Why finetune a multimodal model
General multimodal models, like Qwen/Qwen3-VL-Embedding-2B, are trained to serve many tasks: image-text matching, VQA, document understanding, and more. But generality comes at a cost: they're rarely optimal for a specific task.
And VDR isn't the same as searching for a photo of sneakers, right? Exactly. For VDR you need to understand layouts, tables and charts in screenshots. By finetuning with domain data, the model learns specialized patterns and improves substantially.
In the experiment we review, the finetuned model tomaarsen/Qwen3-VL-Embedding-2B-vdr rises from NDCG@10 0.888 to 0.947 on the evaluation set. That puts it ahead of much larger models. The lesson? Finetuning on your domain usually beats picking a bigger, generic model.
Components of multimodal training
Training multimodal models with Sentence Transformers follows the same recipe as for text, but with some practical nuances:
Model: multimodal backbone or a combination of encoders.
Dataset: inputs mixing text and images (or video, audio if applicable).
Loss function: guides optimization for retrieval or rerank tasks.
Training arguments: batch, precision, logging, etc.
Evaluator: retrieval metrics like NDCG@10, MAP and Recall@k.
Trainer: coordinates all of the above.
Below I break down each piece with concrete examples you can run or adapt.
Model: VLM finetune vs Router
Most often you finetune an existing multimodal model (for example Qwen3-VL-Embedding-2B). The library automatically detects the modalities supported by the processor and configures the forward and pooling.
Example of loading with precision parameters and image resolution control:
Alternative: build a multimodal model by composition with Router, combining lightweight encoders per modality. It's useful when you want fine control over each encoder or to save resources.
Dataset: the practical example
For the tutorial we used tomaarsen/llamaindex-vdr-en-train-preprocessed, a filtered English version with ~53k examples. The training format was triplets (query, image, negative_0) and for evaluation we kept 4 hard negatives per query.
Practical rule: if your loss requires a label or score column, make sure that column exists. Multimodal entries can be PIL, paths, URLs or arrays.
Loss function: CachedMultipleNegativesRankingLoss and Matryoshka
For retrieval we used CachedMultipleNegativesRankingLoss, which combines hard negatives (columns) and in-batch negatives. Its cached variant enables large effective batch sizes with less memory.
Mini example:
from sentence_transformers.sentence_transformer.losses import CachedMultipleNegativesRankingLoss, MatryoshkaLoss
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=1)
loss = MatryoshkaLoss(model, loss, matryoshka_dims=[2048,1536,1024,512,256,128,64])
Why Matryoshka? It trains the model so that truncating the embedding dimension still performs well. Practical result: you can deploy embeddings of 512 or 256 dimensions to save storage and latency with minimal quality loss.
bf16=True is preferable for numerical stability in vision-language models.
batch_sampler=NO_DUPLICATES avoids duplicates and improves signal for multiple-negatives type losses.
per_device_train_batch_size may look large for a 2B VLM, but mini_batch_size=1 in the loss and caching make memory manageable.
Using fractions for eval_steps and save_steps makes evaluation and saving happen during the epoch, useful for monitoring.
Evaluator: measuring retrieval correctly
To track progress we used InformationRetrievalEvaluator with text queries and an image corpus that includes hard negatives. Example building corpus and evaluator:
from sentence_transformers.sentence_transformer.evaluation import InformationRetrievalEvaluator
# construir eval_queries y eval_corpus, añadir negativos con offsets
eval_evaluator = InformationRetrievalEvaluator(
queries=eval_queries,
corpus=eval_corpus,
relevant_docs=eval_relevant_docs,
batch_size=1,
show_progress_bar=True,
name="vdr-eval-hard",
)
Using batch_size=1 here avoids OOMs on large models.
Trainer: bring it all together
The final script assembles model, data, loss, args and evaluator into SentenceTransformerTrainer and launches training with trainer.train().
Results: what you can expect
In just one epoch of finetuning, the model tomaarsen/Qwen3-VL-Embedding-2B-vdr reached NDCG@10 = 0.947 on the evaluation set (300 queries, 1500 documents, cosine similarity). The base model Qwen/Qwen3-VL-Embedding-2B scored 0.888.
The comparison table showed the finetuned model outperformed 20 tested models, including some 8B-parameter models. With full 2048-dimensional embeddings the peak was 0.948, but thanks to Matryoshka performance stays almost intact even truncating to 512 or 1024 dimensions.
Robustness summary by dimension:
2048d: finetuned 0.948
1024d: finetuned 0.946
512d: finetuned 0.945
256d: finetuned 0.937
64d: finetuned 0.876
Practical decision: the author saved the model with truncate_dim=1024 as a balance between quality and storage.
Multimodal rerankers: another path
You can also train multimodal rerankers (Cross Encoders) with the same infrastructure, using CrossEncoderTrainer and losses like BinaryCrossEntropyLoss.
Two common architectures:
Any-to-Any + LogitScore: uses the LM head to generate tokens and computes log-odds.
Feature Extraction + Pooling + Dense: extracts features and projects to a score, more cost-efficient at inference.
The flow is similar: build datasets per direction (image_to_text and text_to_image), create task prompts and train with positive and negative examples.
Practical tips for your project
If your corpus has consistent layouts or chart types, finetuning will give large gains.
Start with one epoch and evaluate; sometimes 1-2 epochs are enough for VDR with synthetic or well-curated data.
Use Matryoshka if you plan to deploy under storage or latency constraints.
Prefer BF16 over FP16 in VLMs when your hardware supports it.
If memory is a problem, CachedMultipleNegativesRankingLoss with mini_batch_size=1 is your friend.
Don't want to touch a giant VLM? Try the Router approach with lighter encoders and then train them to align embedding spaces.
Final reflection
Finetuning multimodal models isn't just for labs with GPU clusters anymore. With tools like Sentence Transformers you can adapt a VLM to your domain and achieve significant quality jumps in real tasks like VDR. Investing in domain data and a smart training setup often pays off more than just increasing model size.