SPEED-Bench: unified benchmark for speculative decoding

SPEED-Bench arrives to put order in how we measure speculation in language models. How well does the idea of using a lightweight model to propose several tokens ahead — then letting the target model verify them in parallel — actually work? It depends a lot on the data, the service pattern, and the system. SPEED-Bench proposes a standard to measure it in a realistic way.

What is SPEED-Bench

Speculative decoding (SD) uses a draft model that’s lightweight to speculate multiple future tokens, which the target model then verifies in parallel. The neat part: you can improve throughput without changing the final model’s exact output distribution.

SPEED-Bench is a unified benchmark designed to evaluate SD under conditions close to production. It combines datasets with wide semantic variety, realistic context-length buckets, and a measurement framework that integrates production-grade inference engines like TensorRT-LLM, vLLM and SGLang.

Domain	Llama 3.3 70B (N-Gram)	GPT OSS 120B (EAGLE3)	Qwen3-Next (MTP)
Coding	1.54	2.46	3.34
Math	1.43	2.46	3.13
Roleplay	1.15	1.87	2.09
Writing	1.33	1.98	2.46
Mean AL	1.41	2.25	2.81
Mean Speedup	0.88x	1.34x	1.20x

What is SPEED-Bench

Why previous benchmarks fail

How SPEED-Bench is designed

Qualitative split

Throughput split

Measurement framework

Practical example and tool output

Important technical findings

Practical recommendations if you work with SD

Original source

Stay up to date!

SPEED-Bench: unified benchmark for speculative decoding