Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivers Up to 3x Faster with No Quality Loss

0 0 4 minutes read

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivers Up to 3x Faster with No Quality Loss

Large language types are incredibly powerful, but let’s be honest—they are speed of thought it’s still a huge headache for anyone trying to use them in production. Google just launched Authors of Multi-Token Prediction (MTP). Of course Gemma 4 exemplary family. This is special speculative decoding architecture it can actually triple (3x) your speed time to thinkall without sacrificing a little output quality or precision of thought. The release comes a few weeks after Gemma 4 surpassed 60 million downloads and directly addresses one persistent pain point in releasing large language models: the memory bandwidth bottleneck that slows down token generation regardless of how powerful the hardware is.

Why LLM Inference is slow?

Most of today’s languages are automated. They produce exactly one sign at a time, one after the other. Every generation of a single token requires loading billions of model parameters from VRAM (video RAM) into computing units. This process is defined as memory-bandwidth bound. The bottleneck is not the raw computing power of the GPU or processor, but the speed at which data can be transferred from memory to computing units.

The result is a significant latency bottleneck: the computer sits idle while the system is busy moving data all over the place. What makes this inefficient is that the model uses the same amount of computation on the partially predicted token as predicting the “words” after “Actions speak louder than…” as it does on generating complex logical inferences. There is no way in standard automatic stacking to use how easy or hard it is to predict the next token.

What is Speculative Decoding?

Predictive decoding is the basic mechanism on which Gemma 4’s MTP drafts are built. The strategy separates token generation from verification by pairing two models: a lightweight drawer and a heavy target model.

Here’s how the pipeline works. A small, fast draft model proposes several future tokens in quick succession – a “draft” sequence – in less time than a large target model (eg, Gemma 4 31B) that takes processing even one token. The target model then verifies all these suggested tokens in parallel in a single forward domain. If the target model matches the draft, it accepts the entire sequence – and generates one more token in the process. This means that an application can generate a full draft sequence and one additional token in about the same clock time it would normally take to generate just one token.

Since Gemma 4’s core model saves the final verification step, the output is the same as what the target model would produce on its own, token for token. There is no tradeoff for quality – lossless acceleration.

MTP: What’s New in Gemma 4 Drafter Architecture

Google has introduced several architectural improvements that make the Gemma 4 MTP drafts more efficient. Draft models simply use the target model’s activation and share its KV cache (key-value cache). The KV cache is a general optimization in transformer inference that stores central attention calculations so that they do not need to be recalculated at every step. By sharing this cache, the compiler avoids wasting time recalculating the context a large target model has already processed.

In addition, in the E2B and E4B edge models, the Gemma 4 variant is much smaller designed to work on mobile and edge devices – Google has used an efficient way to integrate the embedding layer. This directly addresses a prominent bottleneck in edge hardware: computation of the final logit, which maps internal model representations to vocabulary probabilities. The integration method accelerates this step, improving the speed of end-to-end production on hardware-managed devices.

With hardware-specific performance, the Gemma 4 26B mix-of-experts (MoE) model addresses the unique challenges of routing to Apple Silicon in a batch size of 1. However, increasing the heap size to between 4 and 8 unlocks up to ~2.2x local speedup. Similar benefits depending on the batch size are observed on the NVIDIA A100 hardware.

Key Takeaways

Google has released a draft version of Multi-Token Prediction (MTP) for the Gemma 4 model family, delivering up to 3x faster speeds without any degradation in output quality or predictive accuracy.
MTP drafters use a predictive prediction architecture that pairs a lightweight draft model with a heavy target model – the drafter proposes several tokens at once, and the target model validates them all in a single forward pass, breaking the one-token-at-a-time bottleneck.
The draft models share the target model’s KV cache and activation, and for the E2B and E4B edge models, an efficient integration method in the encoder addresses the bottleneck of the final logit equation – enabling fast generation even on memory devices.
MTP drafts are now available under the Apache 2.0 license, with model weights on Hugging Face and Kaggle.

Check it out Model weights again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.?contact us

The post Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivers Up to 3x Faster with No Quality Loss appeared first on MarkTechPost.

pleasuremandarya@gmail.com 2 hours ago

0 0 4 minutes read

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivers Up to 3x Faster with No Quality Loss

Why LLM Inference is slow?

What is Speculative Decoding?

MTP: What’s New in Gemma 4 Drafter Architecture

Key Takeaways

pleasuremandarya@gmail.com

Leave a Reply Cancel reply

Cybersecurity in 2026: AI Attacks, Identity‑First Defense, and the New Playbook for Resilience

Top 10 Best Earning Apps in 2026 (Legit, Paying Fast & With Official Download Links)

Punk’s proposal (and tea), Supernova’s Steve reigns, Arslan’s ashes fall from the top: FGC’s sweaty weekend

“Master Google AI Overviews in 2026 with the GEO Playbook”

How Agentic AI Will Change Workflows in 2026 (Complete Guide)

“12 Best Fully Funded Scholarships for International Students in 2026 (Apply Now with Official Links)”

Why LLM Inference is slow?

What is Speculative Decoding?

MTP: What’s New in Gemma 4 Drafter Architecture

Key Takeaways

pleasuremandarya@gmail.com

Punk's proposal (and tea), Supernova's Steve reigns, Arslan's ashes fall from the top: FGC's sweaty weekend

German quantum computer startup Elektron raises €57m

Related Articles

Inworld AI Introduces Realtime TTS-2: A Closed Voice Model That Adapts to How You Really Speak

Games people – and machines – play: Unraveling strategic thinking to improve AI | MIT News

Why Gradient Descent Zigzag and How Momentum Corrects It

Zyphra Introduces Tensor and Sequence Parallelism (TSP): A Hardware-Aware Training and Inference Strategy That Delivers 2.6x Throughput Over Match TP+SP Baselines

Leave a Reply Cancel reply

Cybersecurity in 2026: AI Attacks, Identity‑First Defense, and the New Playbook for Resilience

Top 10 Best Earning Apps in 2026 (Legit, Paying Fast & With Official Download Links)

Punk’s proposal (and tea), Supernova’s Steve reigns, Arslan’s ashes fall from the top: FGC’s sweaty weekend

“Master Google AI Overviews in 2026 with the GEO Playbook”

How Agentic AI Will Change Workflows in 2026 (Complete Guide)

“12 Best Fully Funded Scholarships for International Students in 2026 (Apply Now with Official Links)”