← Back to blog

The Speedup Is in the Plumbing: Reading Gemma 4's MTP Drafters Carefully

The Bottleneck That Doesn't Show Up in FLOPs

Standard autoregressive decoding has a structural problem that flop counts hide: it spends almost all of its time moving weights, not computing with them. For a 31B-parameter dense model generating a single token on a consumer GPU, the device shuttles tens of gigabytes from VRAM to compute units to produce a few bytes of output. The compute units sit idle for most of that round trip.

This is memory-bandwidth-bound execution, and it is why upgrading from an A100 to a B200 buys surprisingly little for small-batch inference at a given model size. You are not waiting for math. You are waiting for memory.

Speculative decoding — the technique behind Google's newly released Multi-Token Prediction (MTP) drafters for the Gemma 4 family — attacks this from an unusual angle. It stops trying to make memory faster and starts spending the idle compute that the bottleneck creates. The drafters ship under Apache 2.0 on Hugging Face and Kaggle, with day-one support across transformers, MLX, vLLM, SGLang and Ollama. Google reports up to a 3x speedup with zero output degradation.

The "up to" is doing a lot of work, and the interesting story is in why.

What MTP Drafters Actually Do

The architecture pairs a heavy target model (Gemma 4 31B, say) with a much smaller drafter that proposes candidate tokens cheaply. The target model verifies a batch of candidates in a single forward pass, accepting tokens that match its own probability distribution and rejecting the rest. The verification step itself generates one bonus token on top of any accepted draft tokens.

When the drafter is accurate, the system produces the full drafted sequence plus one token in the latency budget of a single token. When it isn't, throughput falls back toward standard decoding. Crucially, every accepted token is a token the target model would have produced anyway — the technique is mathematically lossless, not a distillation tradeoff.

That much is the textbook explanation. It is also not where the 3x comes from.

The Speedup Lives in the Implementation, Not the Idea

Speculative decoding has been published since 2022. Naive implementations exist, and they do not produce a 3x speedup. The Gemma 4 release earns its number through two specific engineering decisions buried in the announcement.

Activation and KV cache sharing. The drafter does not maintain its own key-value cache or recompute the prompt's context from scratch. It reuses the target model's activations and KV cache directly. This sounds like a minor optimization; in practice it is the optimization. Recomputing context for two models in lockstep would erase most of the speedup the small drafter buys you, because cache rebuilds are themselves memory-bandwidth-bound. By piggybacking on the target's state, the drafter's marginal cost stays close to the cost of its own forward pass and nothing more.

The side effect is tight coupling. You cannot swap the drafter for an unrelated small model, and you cannot independently upgrade either component without recalibrating against the other. Custom fine-tunes of Gemma 4 likely need their own custom-trained drafter to keep acceptance rates intact — the released drafter was trained against the base model's distribution, and the further your fine-tune moves that distribution, the more often the target will reject draft tokens.

Embedder clustering on the edge variants. For the E2B and E4B edge models, the final logit computation — projecting the hidden state into the full vocabulary — becomes the dominant cost per token, not the transformer stack. Google addresses this with a clustering scheme in the embedder that lets the drafter skip the full logit projection. This is the kind of optimization that only matters at the small-model end of the curve, where the embedder-to-transformer cost ratio inverts. Without it, MTP on a 2B-parameter on-device model would be a much less interesting product.

Worth noting: the E4B drafter's own model card on Hugging Face claims "up to 2x" speedup, not the 3x headline figure from the blog post. The 3x is a family-wide upper bound, almost certainly anchored on the larger dense or MoE variants under favorable batch conditions. The edge models, even with the embedder optimization, ship with a lower ceiling.

These are the load-bearing pieces. The high-level "speculative decoding" framing is the wrapper around them.

Where the Acceleration Disappears

Two failure modes are worth understanding before deploying MTP.

The first is the acceptance rate problem. Speedup depends almost entirely on the fraction of drafter-proposed tokens the target accepts, and this is highly content-dependent. Predictable text — boilerplate code, formulaic prose, common phrase completions — yields high acceptance and large speedups. Genuinely novel reasoning, mathematical derivation, or long-tail content does the opposite: the small drafter diverges from the target's distribution exactly where the target is doing its most valuable work, acceptance collapses, and the speedup shrinks toward 1x.

The uncomfortable implication: the 3x figure is most achievable on workloads where inference speed matters least, and least achievable on workloads where it matters most. A coding assistant autocompleting for i in range( will see the upper bound. An agent reasoning through a tool-use chain on a problem it has not seen before will not.

The second is MoE routing overhead on small batches. Google's own numbers note that the 26B mixture-of-experts variant on Apple Silicon barely benefits from MTP at batch size 1, because expert routing dominates the savings. The roughly 2.2x speedup only appears at batch sizes of 4 to 8. The same pattern shows up on Nvidia A100.

For a local chatbot running on a Mac — one user, one request at a time — the practical gain on the 26B MoE model lands well below the headline 3x. This is the scenario MTP is most often marketed for and the one where its limits show up most directly. The dense Gemma 4 variants behave better at batch 1, but the MoE-vs-dense decision now has a new axis to weigh: MoE wins on quality-per-parameter, dense wins on single-request latency under MTP.

"Zero Quality Degradation" Is the Wrong Frame

The announcement emphasizes that speculative decoding produces outputs identical to standard decoding. This is true and worth saying — distillation-based fast models have historically made the opposite tradeoff. MTP does not.

But framing the question as quality vs. speed obscures a more practical concern: latency variance. Because acceptance rates are content-dependent, MTP introduces per-request latency variability that standard decoding does not have. Two requests of identical token length can take meaningfully different wall-clock times depending on how predictable their content turned out to be.

For UX-sensitive applications — chat, voice, agentic loops where each step blocks the next — this variability can be more disruptive than a higher but consistent baseline. The pattern shows up reliably with dynamic-throughput techniques: median latency improves on the dashboard, p99 latency gets worse, and the user-visible experience tracks p99.

If you are evaluating MTP for production, measure p50, p95 and p99 separately on a representative traffic sample. The median will look great. The tail is where the decision lives.

Practical Conclusions

  1. Enable MTP by default for completion-style workloads — code autocomplete, templated generation, structured extraction. These sit at the upper end of the acceptance-rate curve.
  2. Benchmark carefully for reasoning and agentic workloads. Acceptance rates may be low enough that the architectural complexity is not worth it, especially when tool calls already fragment generations.
  3. If you fine-tune Gemma 4, plan for a custom drafter. The off-the-shelf one will drift out of alignment with your fine-tune's distribution and acceptance rates will decay.
  4. Batch where you can, especially on MoE. Single-request local inference on Apple Silicon will see a fraction of the headline number on the 26B variant.
  5. Track tail latency, not just median. MTP shifts the latency distribution, not just its center.

The release is genuinely useful. It is also a reminder that "3x faster" is a function of the workload and the hardware, not a property of the model.