Gemma 4: open multimodal AI that runs on-device | Keryc
Gemma 4 arrives as a complete package: an open model under Apache 2, multimodal (text, image, audio, video), sizes meant to run from your laptop up to a server, and with results that in many cases work excellently without needing fine-tuning.
What's new with Gemma 4
Gemma 4 combines proven ideas and focused improvements to offer a practical, efficient model family:
Apache 2 license and open checkpoints for free use and deployment.
Multimodal: text + image + video; the smaller variants also handle audio.
Designed to run on many infrastructures: Transformers, llama.cpp, MLX, WebGPU, Rust, ONNX and more.
Four base sizes, all with a base checkpoint and instruction-tuned checkpoint: E2B (2.3B effective), E4B (4.5B effective), 31B dense and 26B A4B (MoE with 4B active).
Long context: 128k for E2B/E4B and 256k for the large models.
Quick takeaway? Models you can try today, even on-device, and designed to be efficient when quantized.
Architecture and technical details
Gemma 4 brings together known components tuned for multimodality and long context. Here’s the essentials, straight to the point:
Mixed attention: alternating layers of local window (sliding-window) and full-context global attention. Typical local window: 512 tokens on smaller models, 1024 on large ones.
Dual RoPE: standard RoPE in window layers and proportional RoPE in global layers to extend context in a stable way.
Per-Layer Embeddings (PLE): a second embedding table that creates a reduced vector per token for each layer. This lets each layer receive token-specific information when it needs it, instead of forcing the initial embedding to hold everything. It’s a low-parameter cost specialization per layer.
Shared KV Cache: the last N layers reuse keys and values from a previous layer of the same type, saving memory and compute during long-context inference.
Vision encoder: learned 2D positions and multidimensional RoPE, preserves aspect relationships and allows several visual token budgets (70, 140, 280, 560, 1120) to tune latency vs quality.
Audio encoder: USM-style conformer, sharing the same base as Gemma-3n for compatibility.
These pieces make Gemma 4 ideal for quantization and for running with very long contexts without breaking the user experience.
Performance and metrics
LMArena (text): 31B dense ≈ 1452; 26B MoE (4B active) ≈ 1441. That places the models in a league similar to GLM-5 or Kimi K2.5, but with a much lower effective parameter count.
In informal tests, multimodal operation (image/audio + text) approaches the quality of pure-text for practical tasks like captioning, OCR and detection.
Important: the numbers come from the release report and are estimates for text context; interpretation always needs nuance depending on the task.
Multimodal capabilities and practical examples
Gemma 4 works well out-of-the-box for real tasks:
OCR and structured extraction (responds in JSON with bounding boxes without rigid instructions).
Detection and pointing in GUI interfaces (natively generates coordinates relative to the image).
Captioning and description of complex scenes.
Transcription and description of spoken audio (not trained to interpret music or non-verbal sounds as part of speech).
For more advanced cases (video with audio, tool-calling, or fine-tuning) Hugging Face publishes examples using AutoModelForMultimodalLM, AutoProcessor and the integrated chat template.
Deployment: where and how to run it
Gemma 4 has day-0 support on many infrastructures:
Transformers (with bitsandbytes, PEFT, TRL).
llama.cpp / llama-server and GGUF formats for local apps and agents like Pi, openclaw or hermes.
transformers.js and WebGPU for browser execution.
ONNX and checkpoints for hardware backends.
mistral.rs for a Rust engine with agentic features.
MLX for optimized multimodal pipelines.
Practical tips:
To reduce active memory on Apple Silicon use TurboQuant (example: --kv-bits 3.5 --kv-quant-scheme turboquant).
The E2B/E4B variants are ideal for prototypes on laptop or Raspberry; 26B A4B or 31B for servers or large GPUs.
Quick install of llama.cpp server (example):
# macOS
brew install llama.cpp
# Windows
winget install llama.cpp
# start server with a GGUF
llama-server -hf ggml-org/gemma-4-E2B-it-GGUF:Q4_K_M
Fine-tuning, training and demos
Gemma 4 is built to be extended:
TRL now supports multimodal tool responses during training, opening the door to training agents that receive images from the environment in real time.
Practical example: training with CARLA where the model learns to drive by watching the camera and acting; after training the agent reliably avoids pedestrians and changes lanes.
Integration with Vertex AI: Hugging Face documents how to build containers and launch training jobs with H100 GPUs.
Short snippet to launch a job on Vertex AI (skeleton):
If you prefer a UI to experiment, Unsloth Studio lets you load models from the hub and try local fine-tuning or in Colab.
Practical reflection: what can you do today?
If you're a developer or researcher, Gemma 4 lets you iterate fast: local tests, quantization and deploying multimodal agents without relying on proprietary APIs. If you're a product maker, try the smaller variants for app features that need vision and speech. And if you're just curious, try the demos in-browser or spin up a local server and see how well it understands your own images and audio.
Limitations? Yes. We don't know the exact data mix or full training recipe, and interpreting musical audio or non-verbal sounds isn't guaranteed. Always validate with your dataset and test robustness in production.
Gemma 4 is a strong demonstration that powerful multimodal AI can be open and usable across many environments. Ready to try it on your laptop or project? Share your results with the community.