DiffusionGemma speeds up text generation 4x

Jun 10, 2026Keryc Díaz3 minutes

Today Google presents DiffusionGemma, an experimental model that rethinks how text is generated to gain a lot of speed without relying on the old “typing” process word by word.

What is DiffusionGemma and why it matters

DiffusionGemma is an open model under the Apache 2.0 license: a Mixture of Experts (MoE) of 26B that activates only 3.8B parameters during inference. Instead of producing tokens sequentially, it generates whole blocks of text at once, which lets it reach up to 4x the speed on dedicated GPUs.

So what does that mean for you as a developer or creator? Less latency when you run models locally, snappier responses in interactive editors, and better experiences in flows where every millisecond counts.

Practical advantages (and their limits)

Speed: Up to 1000+ tokens per second on an NVIDIA H100 and 700+ tokens per second on a GeForce RTX 5090. That turns tasks that used to feel slow into almost instant interactions.
Accessible hardware footprint: Even though it's a 26B model, being MoE means it only uses 3.8B per inference, and it can fit in 18GB of VRAM when quantized. Great for high-end consumer GPUs.
Bidirectional attention: It generates 256 tokens in parallel at each step, so each token can attend to every other token. What is that good for? Non-linear tasks like inline editing, code filling, amino-acid sequences or even solving Sudoku.
Smart self-correction: DiffusionGemma iteratively refines the entire block of text, correcting itself on the fly instead of relying only on left-to-right order.

Important: DiffusionGemma prioritizes speed and parallel generation. That implies that, in pure quality, it does not match the autoregressive Gemma 4. If your product demands the highest quality, Gemma 4 remains the better choice.

Why use diffusion for text?

If you know image diffusion generators, the idea is similar: start from noise and refine until you get a coherent result. For text it works like this:

The model starts with a “canvas” of random tokens.
It makes iterative passes, fixing correct tokens and using that context to improve the rest.
The block converges into readable, polished text.

The key advantage is that the model processes the whole paragraph as a set, enabling patterns that sequential models struggle with — for example properly closing complex Markdown blocks or filling and running code snippets almost in real time.

Where it shines and where it doesn't

DiffusionGemma is especially useful when you run models locally, have low concurrency and want minimal latency: interactive editors, rapid experiments, prototypes that need immediate answers. In cloud servers handling thousands of simultaneous requests, autoregressive generation is still more cost- and performance-efficient.

How to get started and compatible tools

Download the weights on Hugging Face: the model is publicly available under Apache 2.0.
Integrations and frameworks: it works with MLX, vLLM (integration backed by Red Hat) and Hugging Face Transformers. There's also a fine-tuning tutorial with Hackable Diffusion (JAX) and work with Unsloth and NVIDIA NeMo. Support for llama.cpp is coming soon.
Hardware optimizations: Google worked with NVIDIA to support quantization and speed up inference on consumer GPUs (RTX 5090 and 4090) and on enterprise systems like Hopper and Blackwell using NVFP4.

Curious and practical cases

One team fine-tuned DiffusionGemma to solve Sudoku, a task that frustrates autoregressive models because each number depends on future placements. Bidirectional attention makes it feel much more natural.
Generating non-linear structures: editing a paragraph in the middle without rewriting everything, filling code with full context, or working with biological sequences are scenarios where parallelization provides real advantage.

Final reflection

DiffusionGemma doesn't aim to replace Gemma 4 in quality, but to expand the toolkit: if you want fast interaction, inline editing and local experimentation on consumer GPUs, this proposal opens interesting doors. Curious to reduce latency in your app or try new ways to generate text? This model is an invitation to experiment.

Original source

https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

What is DiffusionGemma and why it matters

Practical advantages (and their limits)

Speed: Up to 1000+ tokens per second on an NVIDIA H100 and 700+ tokens per second on a GeForce RTX 5090. That turns tasks that used to feel slow into almost instant interactions.

Accessible hardware footprint: Even though it's a 26B model, being MoE means it only uses 3.8B per inference, and it can fit in 18GB of VRAM when quantized. Great for high-end consumer GPUs.

Bidirectional attention: It generates 256 tokens in parallel at each step, so each token can attend to every other token. What is that good for? Non-linear tasks like inline editing, code filling, amino-acid sequences or even solving Sudoku.

Smart self-correction: DiffusionGemma iteratively refines the entire block of text, correcting itself on the fly instead of relying only on left-to-right order.

Important: DiffusionGemma prioritizes speed and parallel generation. That implies that, in pure quality, it does not match the autoregressive Gemma 4. If your product demands the highest quality, Gemma 4 remains the better choice.

Why use diffusion for text?

If you know image diffusion generators, the idea is similar: start from noise and refine until you get a coherent result. For text it works like this:

The model starts with a “canvas” of random tokens.

It makes iterative passes, fixing correct tokens and using that context to improve the rest.

The block converges into readable, polished text.

Where it shines and where it doesn't

How to get started and compatible tools

Download the weights on Hugging Face: the model is publicly available under Apache 2.0.

Integrations and frameworks: it works with MLX, vLLM (integration backed by Red Hat) and Hugging Face Transformers. There's also a fine-tuning tutorial with Hackable Diffusion (JAX) and work with Unsloth and NVIDIA NeMo. Support for llama.cpp is coming soon.

Hardware optimizations: Google worked with NVIDIA to support quantization and speed up inference on consumer GPUs (RTX 5090 and 4090) and on enterprise systems like Hopper and Blackwell using NVFP4.

Curious and practical cases

One team fine-tuned DiffusionGemma to solve Sudoku, a task that frustrates autoregressive models because each number depends on future placements. Bidirectional attention makes it feel much more natural.

Generating non-linear structures: editing a paragraph in the middle without rewriting everything, filling code with full context, or working with biological sequences are scenarios where parallelization provides real advantage.

Final reflection