SPEED-Bench: unified benchmark for speculative decoding | Keryc
SPEED-Bench arrives to put order in how we measure speculation in language models. How well does the idea of using a lightweight model to propose several tokens ahead — then letting the target model verify them in parallel — actually work? It depends a lot on the data, the service pattern, and the system. SPEED-Bench proposes a standard to measure it in a realistic way.
What is SPEED-Bench
Speculative decoding (SD) uses a draft model that’s lightweight to speculate multiple future tokens, which the target model then verifies in parallel. The neat part: you can improve throughput without changing the final model’s exact output distribution.
SPEED-Bench is a unified benchmark designed to evaluate SD under conditions close to production. It combines datasets with wide semantic variety, realistic context-length buckets, and a measurement framework that integrates production-grade inference engines like TensorRT-LLM, vLLM and SGLang.
Why previous benchmarks fail
Small prompt sets and little semantic diversity.
Short input sequences and batch size = 1, which don’t reflect real loads.
Use of high-level inference stacks that hide system details.
In practice, draft quality and speedup depend on: semantic domain, text entropy, context length, batch size, and whether the load is memory-bound or compute-bound. SPEED-Bench is designed to capture those dependencies.
How SPEED-Bench is designed
SPEED-Bench combines three main elements:
A splitting cualitativa (Qualitative) focused on semantic diversity to measure draft quality: acceptance rates (AR) and acceptance lengths (AL).
A splitting de rendimiento (Throughput) to measure system-level speedups across different input sequence lengths (ISL) and high concurrency.
A framework de medición that normalizes tokenization and formatting, and integrates with production engines to get comparable metrics.
Qualitative split
They aggregated examples from 18 public sources and organized them into 11 categories: Coding, Math, Humanities, STEM, Writing, Summarization, Roleplay, RAG, Multilingual, Reasoning and QA.
Each category has 80 samples, 880 prompts in total.
To maximize semantic diversity they use openai/text-embedding-3-small embeddings and a selection algorithm that minimizes average cosine similarity between pairs within each category. That reduces redundancy and exposes domain-dependent behaviors.
Throughput split
ISL buckets from 1k up to 32k tokens, to reflect long-context applications like code assistants and RAG.
For each bucket: 1,536 prompts (512 per difficulty level: low, mixed, high entropy).
Truncation and padding are controlled to keep prefill costs deterministic and avoid using random tokens.
Measurement framework
Tokenization and formatting are done externally: inferences receive pre-tokenized sequences to remove silent differences between engines.
It captures fine-grained measures from streaming responses: step latency, acceptance behavior, Output TPS and User TPS.
Integration with TensorRT-LLM, vLLM and SGLang to compare under real conditions.
Practical example and tool output
Example run (Llama 3.3 70B Instruct as target, EAGLE3 as drafter, TensorRT-LLM, batch 32):
Acceptance Rate promedio por categoría (ejemplo): overall average AR = 2.4511
Output TPS: 2518.15 (Output TPS/gpu = 314.77)
E2E Request Time (mean): 4.7313 s
TTFT Time (mean): 0.1217 s
They also compare AL and speedups by domain and model. Summary of the table shown in the paper:
Domain
Llama 3.3 70B (N-Gram)
GPT OSS 120B (EAGLE3)
Qwen3-Next (MTP)
Coding
1.54
2.46
3.34
Math
1.43
2.46
3.13
Roleplay
1.15
1.87
2.09
Writing
1.33
1.98
2.46
Mean AL
1.41
2.25
2.81
Mean Speedup
0.88x
1.34x
1.20x
These results make a few things clear:
AL and speedups are highly domain-dependent: Coding and Math (low entropy) allow longer acceptances, while Roleplay and Writing (high entropy) are much harder to speculate.
Lightweight methods like N-Gram can even cause slowdowns at moderate batch sizes.
Native MTP (co-trained with the backbone) tends to achieve larger ALs than post-trained drafters like EAGLE3.
Important technical findings
SD evaluation must look at two things: draft quality by domain and system-level performance under real conditions.
Vocabulary pruning in drafters (for example EAGLE3) reduces cost but can degrade AL on the long tail of inputs, especially in Multilingual, RAG and Summarization.
Using random tokens to measure throughput is dangerous: it produces two main failures that distort results for SD and MoE:
Trivial Response: the model detects noise and replies with clarifications, inflating AL.
Topic Latching: the model latches onto keywords and generates coherent but misleading outputs, reducing AL.
Quantitative example: measuring with random tokens can overestimate throughput by about 23% when SD is active.
Practical recommendations if you work with SD
Evaluate both: acceptance length/acceptance rate and real throughput across different ISL and batch sizes.
Avoid using random tokens as a proxy for prompt load.
Use datasets with semantic diversity and avoid categories with too few samples.
Prefer measurements that control tokenization and prompt formatting externally to compare inference engines.
Keep in mind that vocabulary or head-layer optimizations can help in closed domains but harm generality.
SPEED-Bench is not just a dataset: it’s an ecosystem with datasets, a measurement framework and scripts ready to plug into SD pipelines. It serves both researchers who want to compare algorithms and production teams that need to understand real trade-offs between latency, throughput and output fidelity.
Summary: SPEED-Bench is a technical, practical benchmark to evaluate speculative decoding (SD) under production-like conditions, combining semantic diversity and realistic throughput measurements. It includes two splits (Qualitative and Throughput) and a framework that normalizes tokenization and integrates with production inference engines.
Stay up to date!
Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.
SPEED-Bench: unified benchmark for speculative decoding