Gemini Embedding 2: native multimodal embeddings for AI

Mar 10, 2026Keryc Díaz3 minutes

Google

Today Google launches Gemini Embedding 2 in Public Preview, its first fully multimodal embeddings model built on the Gemini architecture. What does that mean for you or anyone working with diverse data? Basically, you can now map text, images, video, audio and documents into the same semantic space, in over 100 languages, without stitching together a bunch of complex pipelines.

What is Gemini Embedding 2

Gemini Embedding 2 turns different kinds of data into vectors that capture intent and meaning. Instead of one model for text, another for images and another for audio, everything lives in a single embedding space. Why is that useful? Because it makes tasks like semantic search, RAG (Retrieval-Augmented Generation), sentiment analysis and clustering with multimodal data much simpler.

The model is available in Public Preview via the Gemini API and Vertex AI, and works with popular tools like LangChain, LlamaIndex, Haystack, Weaviate, QDrant, ChromaDB and other vector search engines.

Key capabilities and supported formats

Text: accepts up to 8192 tokens of context, ideal for long documents or wide contexts.
Images: can process up to 6 images per request; PNG and JPEG formats.
Video: supports clips up to 120 seconds in MP4 and MOV.
Audio: ingests and embeds audio natively, without needing to transcribe first.
Documents: can embed PDFs directly, up to 6 pages per file.

The model also understands interleaved inputs: you can send text and images together in the same request and the embedding will capture the relationships between them. This matters when the relevant information isn't confined to a single format.

Performance and output configuration

Gemini Embedding 2 includes Matryoshka Representation Learning (MRL), a technique that "nests" information and allows flexible output dimensions. The default vector dimension is 3072, but you can scale it down depending on your performance and storage needs. Google recommends using 3072, 1536 or 768 based on the trade-off between quality and cost.

The model also improves voice capabilities and outperforms leading models on text, image and video tasks, according to the metrics Google shares. In practice, that means more accurate and representative embeddings for multimodal applications.

Practical use cases

Customer support: combine call transcriptions, photos users send and ticket text to retrieve more relevant solutions.
Multimedia archives: index podcasts, videos and articles and search them by meaning, not just keywords.
Research and analysis: group documents, images and audio by topic without converting everything to text first.
Products with RAG: improve context selection for generative models using a single multimodal semantic space.

Sounds like magic? Not really. It's practical, incremental improvement on what we already did, but now without fragmenting the pipeline across formats.

How to try it

If you want to experiment, Google offers interactive notebooks for the Gemini API and Vertex AI Colab that show multimodal embedding examples. You can also integrate it with vector DB libraries and context orchestration frameworks like LangChain and LlamaIndex to build proofs of concept quickly.

Practical tip: start with smaller dimensions (768 or 1536) to explore cost and latency; when you have evidence of impact, move up to 3072 for maximum quality.

Final thought

Gemini Embedding 2 isn't just another model: it's a piece that simplifies architectures and makes working with data that isn't only text more accessible. Are you working with multimedia, voice or documents? This model can reduce friction and open new possibilities for search, generation and semantic analysis.

Original source

https://blog.google/innovation-and-ai/technology/developers-tools/gemini-embedding-2

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

What is Gemini Embedding 2

Key capabilities and supported formats

Text: accepts up to 8192 tokens of context, ideal for long documents or wide contexts.

Images: can process up to 6 images per request; PNG and JPEG formats.

Video: supports clips up to 120 seconds in MP4 and MOV.

Audio: ingests and embeds audio natively, without needing to transcribe first.

Documents: can embed PDFs directly, up to 6 pages per file.

Performance and output configuration

Practical use cases

Customer support: combine call transcriptions, photos users send and ticket text to retrieve more relevant solutions.

Multimedia archives: index podcasts, videos and articles and search them by meaning, not just keywords.

Research and analysis: group documents, images and audio by topic without converting everything to text first.

Products with RAG: improve context selection for generative models using a single multimodal semantic space.

Sounds like magic? Not really. It's practical, incremental improvement on what we already did, but now without fragmenting the pipeline across formats.

How to try it

Practical tip: start with smaller dimensions (768 or 1536) to explore cost and latency; when you have evidence of impact, move up to 3072 for maximum quality.

Final thought