Sentence Transformers integrates multimodal embeddings and rerankers | Keryc
The era of search and retrieval that mixes text, images, audio and video is already here. Can you imagine being able to search an image with a phrase, then refine results with a reranker that understands text+image? Hugging Face and Sentence Transformers introduce multimodal support that makes this practical and accessible.
What are multimodal models?
Traditional embedding models turn text into fixed vectors. Multimodal models extend that idea: they map inputs from different modalities (text, image, audio, video) into the same embedding space. The result? You can directly compare a text query with documents that are images or screenshots using the same similarity functions you already know.
Multimodal rerankers (CrossEncoders) do the opposite: they receive mixed pairs (text+image, image+image, etc.) and return a relevance score per pair. They usually provide higher quality than embeddings, but are slower because they process each pair individually.
Practical note: cross-modal similarities are often lower in magnitude than text-to-text. Don’t be alarmed if the top scores don’t approach 1.0; what matters is the relative ordering.
Tip: large VLMs (e.g. Qwen3-VL-2B) require a GPU with at least ~8 GB VRAM; 8B variants need ~20 GB. If you don't have a local GPU, use a cloud GPU or Google Colab. On CPU they will be very slow: for CPU, CLIP or text-only models perform better.
Multimodal embedding models: loading and use
Loading a multimodal model is as simple as with a text model:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('Qwen/Qwen3-VL-Embedding-2B', revision='refs/pr/23')
For now some integrations require the revision argument until PRs are merged. The model automatically detects supported modalities. model.encode() accepts images (URLs, local paths or PIL objects), text and multimodal dicts.
Example of encoding and cross-modal similarity:
# Encode images
img_embeddings = model.encode([
'https://.../car.jpg',
'https://.../bee.jpg',
])
# Encode text
text_embeddings = model.encode([
'A green car parked in front of a yellow building',
'A red car driving on a highway',
'A bee on a pink flower',
'A wasp on a wooden table',
])
# Similarities
similarities = model.similarity(text_embeddings, img_embeddings)
print(similarities)
You'll see that the best matches remain in order, even if absolute values are moderated by the modality gap.
encode_query / encode_document
For retrieval tasks it's recommended to use encode_query() and encode_document(). Many models include different prompts for queries and documents; these methods automatically apply the correct prompt before generating embeddings.
Multimodal rerankers (CrossEncoder)
Multimodal rerankers score mixed pairs. They're ideal to refine short candidate lists. Example usage with rank:
from sentence_transformers import CrossEncoder
model = CrossEncoder('Qwen/Qwen3-VL-Reranker-2B', revision='refs/pr/11')
query = 'A green car parked in front of a yellow building'
documents = [
'https://.../car.jpg',
'https://.../bee.jpg',
'A vintage Volkswagen Beetle painted in bright green sits in a driveway.',
{'text': 'A car in a European city', 'image': 'https://.../car.jpg'},
]
rankings = model.rank(query, documents)
The reranker usually orders correctly (car image on top, bee below), but remember score ranges can vary depending on whether the pair is text-image or text-text.
You can also use predict() to get raw scores for specific pairs.
Retrieve and rerank: recommended pattern
An effective and scalable pattern:
Initial retrieval with an embeddings model (fast, indexable). Precompute embeddings for the corpus.
Rerank the top-k with a multimodal CrossEncoder to get precision.
processor_kwargs adjusts image resolution and quality (higher max_pixels means more memory and time). model_kwargs controls precision, attention and other model loading parameters.
Supported models (v5.4) and lightweight alternatives
If you have limited hardware, CLIP models remain a solid option for CPU:
sentence-transformers/clip-ViT-L-14
sentence-transformers/clip-ViT-B-16
Best practices and technical considerations
Precompute corpus embeddings and use indexes (FAISS, Milvus) to scale.
Use batching and FP16/bfloat16 when hardware allows to reduce memory.
Keep the modality gap in mind: normalize expectations about score magnitudes, but trust the ordering.
Check what modalities a model supports with model.modalities and model.supports('image').
If you need fine control over chat-style messages, pass raw message dicts to avoid automatic conversion.
If you plan to train/finetune, a dedicated guide is coming soon (Hugging Face announces a post about multimodal training soon).
The arrival of multimodal embeddings and rerankers in Sentence Transformers makes building RAG pipelines and cross-modal search much more direct. The takeaway? You can start experimenting today without reinventing the wheel: try CLIP on CPU, move to VLMs on GPU, and combine fast retrieval with reranking for production.