Just weeks ago Google released Gemma 4, their most capable open model family, and it's already reached millions of downloads. Now the practical news: the new Multi-Token Prediction (MTP) drafters promise to speed up inference up to 3x without losing quality or logical consistency in responses.
What problem are they solving?
Have you ever had an app take forever to reply right when you need it most? That happens because large models spend a lot of time shuttling parameters between memory and processor to generate a single token. The CPU or GPU ends up waiting, underused, and latency skyrockets—especially on consumer hardware.
Developers see this as the bottleneck for putting models into production or running powerful assistants locally. The consequence? Less fluid experiences and fewer apps that actually work well on the edge or on your laptop.
What is speculative decoding and what does MTP do?
Speculative decoding separates the proposal of tokens from their . In plain terms: a lightweight model (the drafter) suggests several tokens at once, using compute time that would otherwise sit idle, and the main model (for example Gemma 4 31B) verifies those suggestions in parallel.
