Doubleword logo black
Product
Products
Doubleword API
NEW
Inference built for scale
Doubleword Inference Stack
High performance inference stack
Use Cases
Async Agents
Long running background agents
Synthetic Data Generation
Generate high volumes of data for fine- tuning
Data Processing
Apply intelligence to large volumes of data
Resources
Documentation
Technical docs and API reference
Workbooks
Ready-to-run examples
Seen in the Wild
Community content and projects
Resource Centre
All our blogs and guides
Technical Blog
Our blog on building inference systems
Al Dictionary
Key Al terms explained
Savings Calculator
See how much you save with Doubleword
Solutions
By Deployment Option
On-premiseCloudHybrid
By Team
AI, ML & Data SciencePlatform, DevOps & ITCompliance & Cyber
Pricing
Docs
Pricing
Get started - Free
Get started - Free
Resources
/
Technical Guide
/
Behind the Stack, Ep. 11 - How Speculative Decoding Speeds Up Language Models
November 5, 2025

Behind the Stack, Ep. 11 - How Speculative Decoding Speeds Up Language Models

Jamie Dborin
Share:
https://doubleword.ai/resources/behind-the-stack-ep-11-how-speculative-decoding-speeds-up-language-models
Copied
To Webinar
•

Large language models have transformed how we build AI systems - but they’re still notoriously slow and expensive to run. Over the last few years, one optimization technique has been attracting serious attention for its potential to make inference faster and more efficient: speculative decoding.

In this episode of Behind the Stack, we explore what speculative decoding is, how it accelerates model inference, and what’s still holding it back.

What is Speculative Decoding?

To understand speculative decoding, we first need to understand how language model inference actually works.

Inference can be divided into two phases:

  1. Prefill (or encoding) - when the model processes your entire prompt.
  2. Decoding - when the model starts generating output tokens one by one.

During prefill, everything happens in parallel. You can feed thousands of words into the model at once, and it efficiently processes them to produce the first token.

But once decoding begins, things slow down dramatically. That’s because large language models are auto-regressive - they take each output token, feed it back in as input, and then generate the next one. This happens sequentially, meaning each new token requires another full model pass.

So while prefill benefits from GPU parallelism and amortized compute costs, decoding is bottlenecked by memory bandwidth - every single step requires the model weights to be reloaded into GPU compute units. That’s why output tokens are often far more expensive (and slower) than input tokens in API pricing.

Speculative decoding aims to fix that.

Adding Parallelism Back Into Decoding

Speculative decoding introduces a form of parallelism back into the decoding process.

The idea comes from speculative execution in CPUs - where processors “guess” future operations to save time, only discarding results if the guess was wrong.

Similarly, speculative decoding makes guesses about what the model will generate next. Rather than waiting for each token one by one, we feed in several draft tokens - potential next words - and ask the model to verify them in a single pass.

If the model agrees with those guesses, we can skip multiple decoding steps at once.

In theory, this can make inference up to many times faster, since we’re effectively generating multiple tokens per forward pass.

Even though this adds more compute work, most of the time models are bandwidth-bound - so doing more work per pass often costs little extra, while saving significant time overall.

The Catch: Where Do Draft Tokens Come From?

Here’s the tricky part: to speculate correctly, we need to guess what the model will say next - before it actually says it.

The tokens we guess are called draft tokens, and finding good ones is the core challenge of speculative decoding.

The original speculative decoding paper solved this by using a smaller model to generate draft tokens. The small model runs faster and produces multiple guesses at once. The large model then checks those guesses, accepts the ones it agrees with, and discards the rest.

This can dramatically speed up inference - if the small model is good enough to make accurate guesses.

However, this approach comes with trade-offs:

  • Smaller models are less capable. They often produce incorrect tokens, meaning the larger model rejects many of them.
  • It adds overhead. The small model must still run before the large one, introducing latency.
  • It consumes GPU memory. The small model has its own weights and KV cache, potentially eating into resources that could have been used for the main model.
  • Tokenizer mismatch. Speculative decoding only works if both models use the exact same tokenizer - which isn’t always the case.

So while the “small + big model” method can work well for certain architectures, it’s not always practical or efficient in real-world deployments.

Alternative Sources of Draft Tokens

Researchers have explored several other approaches to generating draft tokens:

  1. Training additional model heads.
    Some methods add extra “heads” to the language model that predict future tokens (e.g., the next 4 or 5) during a single forward pass. These can then be used as speculative guesses. This approach ensures shared tokenization but requires additional training.
  2. Using past outputs or user suggestions.
    APIs like OpenAI’s support passing in suggested continuations - known phrases or past completions - as speculative guesses. This can work well if you have strong prior data, but performance depends on how good your guesses are.

  3. Prefix trees of past outputs (Doubleword’s approach).
    At Doubleword, we’ve developed our own speculative decoding technique that uses a weighted prefix tree built from previous model outputs.

    As the model generates text, it walks down this tree to find past completions that share a common prefix - and speculatively decodes from those continuations.

    Over time, as the system observes more real-world requests, the prefix tree adapts - prioritizing the most frequent or probable completions. This means the more you use it, the faster it gets, automatically tuning itself to your workload distribution.

The Future of Speculative Decoding

Speculative decoding is one of the most promising directions for accelerating LLM inference. By introducing controlled parallelism into the decoding phase, it can unlock significant speed gains - especially when combined with smart draft-token strategies that adapt over time.

However, it’s not a one-size-fits-all solution. For certain workloads or architectures, the overhead may outweigh the benefits. For others - especially those with repetitive, predictable output distributions - it can deliver major efficiency wins.

At Doubleword, we’re continuing to explore these trade-offs as part of our broader mission to optimize and govern inference at scale - across any model, on any infrastructure.

‍

Footnotes

Table of contents:

Heading 2
Heading 3
Heading 4
Heading 5
Heading 6
"
Learn more about self-hosted AI Inference
Subscribe to our newsletter
Thanks you for subscription!
Oops! Something went wrong while submitting the form.

Stop overpaying for inference.

Teams use Doubleword to run low-cost, large-scale inference pipelines for async jobs.
‍
Free credits available to get started.

Get started - Free
Doubleword logo black
AI Inference, Built for Scale.
Products
Doubleword APIDoubleword Inference Stack
Use Cases
Async AgentsSynthetic Data GenerationData Processing
Resources
Seen in the WildDocumentationPricingAsync Pipeline BuilderResource CentreTechnical BlogAI Dictionary
Company
AboutPrivacy PolicyTerms of ServiceData Usage Policy
Careers
Hiring!
Contact
© 2026 Doubleword. All rights reserved.
We use cookies to ensure you get the best experience on our website.
Accept
Deny