Speculative Decoding Explained: Faster Inference Without Quality Loss
By Allan Witt | Updated: October 17, 2025
Unlock significant speed gains for large language models on your own hardware without sacrificing quality. Here’s how it works and how to set it up in popular inference engines.
Why Local LLMs Run Slow
If you run large language models on your own hardware, you know the biggest challenge is inference speed. Getting high-quality models like a 70B parameter LLM to run at a comfortable pace on consumer or prosumer hardware is a constant battle against GPU VRAM requirements, RAM limits, and memory bandwidth bottlenecks. The trade-off between model quality and token throughput (tokens per second) is one we all have to make when chasing real-time inference speed.
Speculative decoding is a technique that changes this equation[1]. It offers a way to accelerate the LLM token generation speed, and the best part is that it does so with absolutely no loss in quality. The final text generated is identical to what the large base model would have produced on its own, just much faster. Real-world tests show speed increases of 4x to 5x, like one user who went from around 30 tokens per second to over 160 t/s on an RTX 3090.
This guide will break down what speculative decoding is, how it works, what hardware you need, and how to enable it in common inference tools like llama.cpp and LM Studio.
Speculative Decoding Explained
To understand speculative decoding, think of it as a collaboration between an experienced manager and a fast intern.
The large base model (main) is your powerful, accurate LLM. This is the manager who is knowledgeable but takes time to think through each word.
The small helper or auxiliary model (draft) is a much smaller, faster, but less accurate LLM. This is the eager intern who can type up ideas very quickly.
Speculative decoding: A smaller draft model proposes candidate tokens, while the larger model verifies them. Accepted tokens are kept; if rejected, the larger model generates replacements.
The process works like this. Instead of the manager writing one word at a time, the intern quickly drafts a short sequence of the next 5 to 16 words. The manager then looks at this entire draft in a single, parallel step. It verifies each word in the draft. As long as the intern’s predictions match what the manager would have written, the words are accepted instantly. The moment the manager finds a word it would have chosen differently, it corrects that single word and discards the rest of the intern’s draft. The process then repeats from that corrected point.
Put simply: speculative decoding uses parallel token verification between a small draft model and a large base model.
This method is faster because LLM inference is often limited by memory bandwidth, which is the speed at which you can load the model’s weights from VRAM. By verifying a batch of tokens at once instead of generating them one by one, you use your hardware’s parallel processing power much more efficiently. You are doing one large memory read for a batch of tokens instead of many individual reads.
Hardware Requirements for Speculative Decoding
Before you can enable this feature, you need a few things in place. The main hardware constraint is having enough memory to load both models at the same time.
First, you need a large base model (main). This is the large, high-quality model you want to accelerate, such as Llama-3.3-70B-Instruct-Q4_K_M.gguf.
Second, you need a small helper model (draft). This should be a much smaller model, ideally from the same family. Using a model from the same family is important because they share the same vocabulary and training style, which makes the draft model’s predictions more likely to be correct.
Good pairings include using Llama-3.2-1B as a draft for a Llama-3.1-70B main model, or Qwen2.5-0.5B for a Qwen2.5-32B main model. People often refer to these as helper/secondary models in retrieval queries.
Third, you need sufficient GPU VRAM or system RAM to load both models simultaneously. This is the primary hardware cost of using the technique. Multi-GPU setups can also help balance the load.
Finally, you need a supported inference engine. This guide will cover how to set it up in the most common tools used by local hardware enthusiasts.
Table: VRAM requirements for speculative decoding setups (main model + draft model)
| Model | quantization | model size | draft | quantization | total size |
|---|---|---|---|---|---|
| Llama 3.3 70B | Q4_K_M | 42.5 GB | Llama 3.2 3B | Q4_K_M | 47.7 GB |
| Qwen3 32B | Q4_K_M | 19.8 GB | Qwen3 0.6B | Q4_K_M | 20.2 GB |
Enable Speculative Decoding Step-by-Step
This is the practical part of the guide where we configure our software. The principle is the same across different tools: you load your primary model and then specify a secondary, smaller draft model.
Speculative Decoding in Llama.cpp
For users of llama.cpp and its server, enabling speculative decoding is done with a few command-line flags.
The key flags you need to know are -m for your main model path, -md for your draft model path, --draft-max for the maximum number of tokens the draft model will generate, and --draft-min for the minimum number of tokens before verification.
Here is an example command to run the llama-server with a 70B main model and a 1B draft model:
./llama-server \
-m ./models/Llama-3.3-70B-Instruct-Q4_K_M.gguf \
-md ./models/Llama-3.2-1B-Instruct-Q4_K_M.gguf \
-ngl 99 -ngld 99 -fa --port 9999 -c 8192 \
--draft-max 16 --draft-min 5
If you are using a multi-GPU setup, a useful tip is to place the smaller draft model on a specific GPU. This reserves the VRAM and bandwidth of your primary GPUs for the large main model. You can control this using the -ngld flag to specify the number of layers to offload for the draft model, often setting it to 99 to put the entire model on one GPU.
Enable Speculative Decoding in LM Studio
Configuring speculative decoding in LM Studio is straightforward, as the user interface integrates the feature directly into the model loading workflow. You begin by loading your main model just as you normally would, using the primary model selection dropdown at the top of the application.
Model selection dropdown in LM Studio — choose and load the main model for inference.
Once your large model is loaded, direct your attention to the right sidebar and ensure the “Model” tab is selected. At the bottom of this sidebar, you will find a section for speculative decoding with a “Draft Model” dropdown menu. LM Studio automatically scans your model library and populates this dropdown with compatible draft models.
If you have an appropriate small model from the same family as your main model, it will appear as an option here.
LM Studio right sidebar — Model tab with draft model options, where speculative decoding can be enabled.
If the dropdown is empty, it means you do not have a suitable draft model downloaded for the currently loaded main model. To help with this, LM Studio includes a helpful hyperlink that says “Read how it works.” Clicking this link will provide you with suggestions for potential draft models you can download that are known to pair well with your main model.
After selecting your draft model, you can proceed to start the server or use the chat interface as usual, now with the speed benefits of speculative decoding enabled.
Optimize Speculative Decoding Performance
Getting the best results from speculative decoding requires some fine-tuning based on your models and hardware.
For the highest acceptance rate, use greedy sampling by setting the temperature to 0 or Top K to 1. The goal is to make the draft model’s most likely token perfectly match the main model’s most likely token.
Finding the right draft length is also a key tuning parameter. A draft that is too short won’t provide much of a speedup. A draft that is too long will likely contain errors, causing the main model to reject the sequence and waste computation. A good starting range is between 5 and 16 tokens.
Your hardware can also be a factor. One user on Apple Silicon hardware initially saw no performance gains. They discovered that adjusting platform-specific settings, like the number of parallel processing paths (-np in llama.cpp), was necessary to unlock the speed benefits. This shows that it may require some experimentation to find the optimal settings for your specific rig.
Speculative Decoding Mistakes to Avoid
There are a couple of common points of confusion when first using speculative decoding.
First, many users wonder if it changes the output. The answer is no. The final generated text is bit-for-bit identical to what the main model would have produced on its own. The large base model verifies every single accepted token. This is purely a speed optimization, not a method for mixing model styles.
Second, a simple but common mistake is getting the model order wrong. You must specify the large model as the main model (-m) and the small model as the draft (-md). Reversing this will cause a massive slowdown, as you will be using a large, slow model to guess tokens for a small, fast model to verify.
Final Thoughts on Faster LLM Inference
Speculative decoding is a powerful and accessible technique for any local LLM user looking to get more performance out of their hardware. It provides a significant boost in token throughput — often 4x to 5x faster inference, with reports of 160+ tokens per second on consumer GPUs — without compromising the quality of your output.
All it requires is a pair of compatible models, enough VRAM to hold them, and a supported inference engine. Speculative decoding improves LLM speed, increases real-time inference performance, and works across llama.cpp, LM Studio, and other inference backends.
Download a small helper draft model for your favorite LLM, fire up your inference server, and see how much faster your local AI experience can be.
To keep things accurate and useful, this article pulls from a mix of resources: technical white papers, benchmark results, open datasets, and hands-on testing by the community. We also point to solid research and trusted publications when it helps explain the trade-offs and techniques around running LLMs locally.
