Gemma 4 QAT: models optimized for mobile and laptop

Jun 5, 2026Keryc Díaz3 minutes

Gemma 4 has been evolving fast over the past couple of months. Now Google is releasing checkpoints trained with Quantization-Aware Training (QAT) so you can run powerful models locally — on your phone or laptop — using much less memory and without losing the quality you expect.

What's in this update

The core idea is simple: instead of compressing the model after training (what's called Post-Training Quantization or PTQ), the compression is simulated during training. That helps keep accuracy when the model is converted to smaller formats.

QAT simulates quantization during training to minimize quality loss when the model is compressed.

Google now offers QAT checkpoints in the popular Q4_0 format and also a new format specially designed for mobile. With that mobile format, Gemma 4 E2B text-only cuts its memory footprint to around 1 GB, making long conversations possible on consumer devices.

What did they optimize for mobile? (in plain terms)

Activation pre-tuning: instead of calculating how to scale data in real time, scaling info is precomputed during training to save compute on the phone's chip.
Per-channel quantization: the compressed layout matches mobile accelerator designs so you avoid slow fallback solutions.
Focused 2-bit quantization: parts that generate tokens are compressed aggressively, while reasoning layers keep higher precision.
Optimized embeddings and KV cache: the vocabulary and short-term memory are compressed, reducing working memory and allowing long conversations.

Also, if you don't use audio or vision, you can deploy text-only mode to cut the footprint even more.

Why does this matter to you today?

Privacy: running locally means less of your data goes to the cloud. Want more private chats?
Cost savings: less GPU use on servers, so lower costs if you're doing massive inference.
Accessibility: you can try large models on ordinary laptops or modern phones.
Flexibility: there are checkpoints ready for popular tools, so you don't have to reinvent the wheel.

How to get started (tools and workflows)

Download weights: Q4_0 models and the mobile format are available on Hugging Face. There are GGUF files ready for llama.cpp and compressed tensors for vLLM.
Run locally: interfaces like llama.cpp, Ollama or LM Studio make it easy to test them on desktop.
Deploy to device: use LiteRT-LM for edge or run in the web with Transformers.js.
Optimization and serving: vLLM to serve large models, MLX for Apple Silicon, and the MTP QAT checkpoints keep Multi-Token Prediction acceleration.
Fine-tuning: you can adapt weights with Hugging Face Transformers and Unsloth if you need to specialize the model.

Small practical tip: if you only need text chat on a mobile, try Gemma 4 E2B text-only without per-layer embeddings; it often requires under 1 GB.

Limitations and points to consider

PTQ is still effective for many tasks; QAT improves quality but requires extra training.
The mobile format trades off precision in some parts to save space; for high-stakes, high-precision tasks you might prefer keeping models at higher precision.
If you need multimodal capabilities (audio or vision), remember those encoders increase the footprint, so adjust based on your use.

The novelty here isn't just that the models are smaller: it's that they are now designed to work well on real hardware without you having to be an optimization expert. That opens the door to faster prototypes, offline products and more private experiences.

Original source

https://blog.google/innovation-and-ai/technology/developers-tools/quantization-aware-training-gemma-4

Stay up to date!

Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.

What's in this update

QAT simulates quantization during training to minimize quality loss when the model is compressed.

What did they optimize for mobile? (in plain terms)

Activation pre-tuning: instead of calculating how to scale data in real time, scaling info is precomputed during training to save compute on the phone's chip.

Per-channel quantization: the compressed layout matches mobile accelerator designs so you avoid slow fallback solutions.

Focused 2-bit quantization: parts that generate tokens are compressed aggressively, while reasoning layers keep higher precision.

Optimized embeddings and KV cache: the vocabulary and short-term memory are compressed, reducing working memory and allowing long conversations.

Also, if you don't use audio or vision, you can deploy text-only mode to cut the footprint even more.

Why does this matter to you today?

Privacy: running locally means less of your data goes to the cloud. Want more private chats?

Cost savings: less GPU use on servers, so lower costs if you're doing massive inference.

Accessibility: you can try large models on ordinary laptops or modern phones.

Flexibility: there are checkpoints ready for popular tools, so you don't have to reinvent the wheel.

How to get started (tools and workflows)

Download weights: Q4_0 models and the mobile format are available on Hugging Face. There are GGUF files ready for llama.cpp and compressed tensors for vLLM.

Run locally: interfaces like llama.cpp, Ollama or LM Studio make it easy to test them on desktop.

Deploy to device: use LiteRT-LM for edge or run in the web with Transformers.js.

Optimization and serving: vLLM to serve large models, MLX for Apple Silicon, and the MTP QAT checkpoints keep Multi-Token Prediction acceleration.

Fine-tuning: you can adapt weights with Hugging Face Transformers and Unsloth if you need to specialize the model.

Small practical tip: if you only need text chat on a mobile, try Gemma 4 E2B text-only without per-layer embeddings; it often requires under 1 GB.

Limitations and points to consider

PTQ is still effective for many tasks; QAT improves quality but requires extra training.

The mobile format trades off precision in some parts to save space; for high-stakes, high-precision tasks you might prefer keeping models at higher precision.

If you need multimodal capabilities (audio or vision), remember those encoders increase the footprint, so adjust based on your use.