T5Gemma 2: multimodal encoder-decoder with 128K

Dec 18, 20253 minutes

T5Gemma 2 arrives as the next generation of encoder-decoder models built on Gemma 3, bringing efficiency and new multimodal and long-context capabilities. Why does this matter? Because now you have compact models ready for quick experimentation, deployment on devices, and tasks that mix text and images over huge context windows.

Architectural innovations

T5Gemma 2 is not just a retrain. It includes structural changes designed to cut parameters and improve inference, especially at smaller scales.

tied embeddings: the encoder and decoder embeddings are shared. This lowers total parameter count and lets you pack more useful capacity into the same memory footprint. Great when you aim for lightweight models on devices.
merged attention: the decoder combines self-attention and cross-attention into a single unified layer. The result: fewer parameters, a simpler architecture, and better parallelism during inference.

The family also offers compact configurations meant for fast iteration: 270M-270M (~370M total, excluding the vision encoder), 1B-1B (~1.7B) and 4B-4B (~7B) parameters.

New capabilities: multimodality and long context

T5Gemma 2 inherits key design features from Gemma 3 that make it more versatile:

Multimodality: the model now understands images together with text thanks to an efficient vision encoder. Think assistants that answer questions about a photo or combine text instructions with diagrams.
Extremely long context: it adopts Gemma 3s alternating local/global attention mechanism, allowing windows up to 128K tokens. Need to analyze legal files, long technical manuals, or extended conversations? This makes it much more feasible.
Massive multilingualism: trained on more diverse data, it supports more than 140 languages out of the box.

Why does a separate encoder help with long context? Because the encoder can build stable global representations of very long input, while the decoder can focus on generation with efficient access to that information.

Performance and comparison

According to Google, T5Gemma 2 improves in key areas versus Gemma 3 and T5Gemma:

Better multimodal performance on several benchmarks, even when adapting base text-only models (270M and 1B) to vision-language tasks.
Substantial gains on long-context tasks thanks to the separate encoder.
Overall improvements in encoding, reasoning, and multilingual abilities compared to counterparts in Gemma 3.

Important note: the post-training results shown are illustrative. Post-training/IT checkpoints are not published; the results come from minimal SFT without RL for comparison. Don’t compare pre-training and post-training scores directly if the benchmarks differ.

T5Gemma 2 keeps the practical idea of T5Gemma: start from a strong decoder-only model, initialize weights, and continue pre-training to save the cost of training from scratch.

Practical use and technical recommendations

Want to try or deploy it? Here are technical considerations and use cases:

On-device and quick prototypes: the compact sizes (especially 270M-270M) are ideal to iterate and bring multimodal capabilities to mobile or edge, provided you combine them with techniques like quantization and pruning.
Multimodal tasks: vision + text for VQA, image annotation, visual assistants, and content review tools.
Long-context scenarios: for example, legal assistants that process case files, code analysis across massive repositories, or summarizing long technical books.
Fine-tuning: the release includes pre-trained checkpoints intended for developers to implement SFT or RLHF according to their application. Keep in mind post-training checkpoints are not distributed.

From a technical perspective, merged attention eases parallelization during generation and reduces inference latency; tied embeddings saves memory and parameter bandwidth without sacrificing shared representations.

Limitations and ethical considerations

No post-training/IT checkpoints released: only the pre-trained models are available.
Adding more languages and multimodality does not remove biases in the training data. If you plan to deploy sensitive applications, consider bias audits and mitigation mechanisms.
The vision encoder is an additional load beyond the text parameter count. Check memory and latency requirements before putting large models into production.

T5Gemma 2 is a clear bet on bringing advanced capabilities (multimodality, long context) into more manageable formats. What will you build first: an assistant that understands screenshots or an automatic summarizer for long technical manuals?

Original source

https://blog.google/technology/developers/t5gemma-2

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.