How Multi-Token Prediction Makes Local LLMs Faster – Without Extra VRAM.
By Allan Witt | Updated: October 17, 2025
👁 multi token prediction in local llm
For anyone running LLMs locally, the goal is always more performance for less cost. We obsess over VRAM, memory bandwidth, and squeezing every last token per second out of our hardware. While prompt processing (TTFT) is often fast, the token generation that follows can be a bottleneck, especially on memory-bandwidth-limited systems. This one-token-at-a-time process, called autoregressive generation, is the fundamental speed limit we all run into.
A technique called Multi-Token Prediction (MTP) is emerging as a powerful way to break this barrier. It’s a software-level optimization that can significantly boost your tokens per second with almost no extra VRAM cost. Let’s break down how it works and what it means for your local setup.
What is Multi-Token Prediction?
Multi-Token Prediction is a form of self-speculative decoding[1]. In simple terms, instead of the model running a full pass to generate just one token, it is trained to predict a whole block of future tokens at once. It essentially makes a series of educated guesses about what it’s going to say next.
This isn’t just wishful thinking. A research paper from Apple, “Our LLM Knows the Future”, demonstrated that existing models already have a surprising amount of internal knowledge about upcoming tokens. MTP fine-tunes a model to harness this ability explicitly, resulting in generation speedups of 2.5x for chat and up to 5x for predictable tasks like coding, all without any loss in output quality.
How Multi-Token Prediction Works in LLM
At first, this might sound impossible. A 70B model has a fixed number of weights. Moving those weights from VRAM to the GPU’s compute units is the main bottleneck. So, how can the same 70B model do more work in roughly the same amount of time?
The secret is trading latency for throughput by exploiting the parallel nature of the Transformer architecture.
Standard Autoregressive Token Generation Explained
To generate 5 new tokens, the process is strictly sequential and latency-bound. The GPU spends a lot of time waiting.
- Forward Pass #1: The model processes your prompt and generates Token 1.
- Forward Pass #2: The model processes [Prompt + Token 1] and generates Token 2.
- Forward Pass #3: The model processes [Prompt + Token 1 + Token 2] and generates Token 3.
- This continues step by step until 5 tokens are produced.
The total cost for 5 tokens is therefore 5 separate, sequential forward passes.
Parallel Token Prediction: How MTP Boosts Throughput
MTP converts this sequential problem into a parallel one. It’s a two-step process that gives the GPU more work to do at once, maximizing throughput.
1. Speculation Pass (First Forward Pass)
The model processes your prompt and simultaneously generates a block of guesses for the future.
The output consists of one confirmed token (Token 1) plus a block of guesses for the next tokens — for example, [Token 2_guess, Token 3_guess, Token 4_guess, Token 5_guess].
2. Verification Pass (Second Forward Pass)
Instead of checking each guess one by one, the model verifies them all in a single, parallel pass. It takes the sequence [Prompt + Token 1 + Token 2_guess + …] as input. Because Transformers are parallel, it calculates the “correct” next token for every position simultaneously. The system then compares the model’s new predictions to the guesses.
If the prediction for position 2 matches Token 2_guess, it’s accepted. If position 3 matches Token 3_guess, it’s accepted, and this continues until a mismatch is found.
If all four guesses are correct, the total cost for producing five tokens is just two forward passes, achieving an approximate 2.5x speedup.
Verification in MTP: Why Accuracy Still Matters
You might wonder—if the model can predict multiple tokens at once, why not just use them directly? Why perform a verification pass at all?
The answer is reliability. Even though the model can predict multiple tokens ahead, those are still probabilistic guesses. The verification step ensures accuracy, maintaining the same high-quality output as standard generation. Without verification, small prediction errors could quickly compound.
Why Fine-Tuning is Needed
You may also ask—why fine-tune or train the model with MTP at all? The reason is that standard models are not optimized to predict several future tokens accurately in one step. Fine-tuning with MTP teaches the model how to make coherent multi-token predictions that remain valid across different contexts. This specialized training enables the model to balance speed and precision effectively.
MTP vs. Standard Speculative Decoding: The VRAM Advantage
If you’re familiar with speculative decoding, you might be thinking this sounds similar. It is, but with one critical difference that matters deeply to hardware enthusiasts: VRAM usage.
Standard Speculative Decoding
Standard speculative decoding requires running a second, smaller “draft model” alongside your main model. This draft model guesses tokens, and the main model verifies them. The downside is that the draft model consumes its own VRAM—VRAM that you could have used for a larger context window or a bigger, more capable main model.
Multi-Token Prediction
With MTP, the main model is trained to be its own draft model. There is no separate model to load. The only overhead comes from tiny, specialized prediction “heads” or LoRA adapters that are part of the main model, adding a negligible amount to the VRAM footprint—often less than 1%.
This is the key takeaway for local LLM users: MTP offers the speed benefits of speculative decoding without the significant VRAM penalty, making it a far more efficient solution for VRAM-constrained systems.
The Catch: Models Must Be Trained for MTP
This is the most important thing to understand: you can’t just turn on MTP for any model. A model must be specifically trained or fine-tuned with MTP heads to support this feature.
This means that our existing library of popular GGUF models, like the standard Llama 3 or Mistral, won’t work with MTP out of the box. For the community to benefit, model creators either need to release versions with MTP baked in, or existing models need to be fine-tuned to add this capability. This would, of course, require regenerating and re-downloading quantized model files.
How to Use MTP with Current Software and Models
As with any new optimization, adoption takes time. The ecosystem for MTP is still developing, but it’s moving quickly.
Supported Inference Engines
vLLM currently has robust support for MTP heads. If you are running models like DeepSeek or Qwen3, you can enable it with a simple command-line flag.
For DeepSeek V3/R1, use:
vllm serve deepseek-ai/DeepSeek-R1 --speculative-config='{"method": "deepseek_mtp", "num_speculative_tokens": 1}'
For Qwen3-Next, use:
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'
llama.cpp does not yet support MTP, though development is active. When a model with MTP heads is loaded, these layers are currently skipped. Given the potential performance gains, especially for CPU and GPU/CPU hybrid inference, MTP support is one of the most anticipated features for the project.
Supported Models
A growing number of models are being released with MTP capabilities built-in, including DeepSeek V3 / R1, Qwen3-Next, and GLM-4.5. As more model creators see the performance benefits, it’s likely this will become a standard feature for high-performance open-source models.
Final Thoughts: What MTP Means for Your Hardware
Multi-Token Prediction is a pure software win that lets you get more out of your existing hardware. For anyone running larger models where token generation speed is the primary bottleneck—especially on multi-GPU setups or systems with slower memory—MTP can transform a “too slow to be usable” model into a perfectly viable one.
While broad support in tools like llama.cpp isn’t here yet, the technology is proven. As MTP becomes more common, it will be a key optimization to look for when choosing your next model, offering a nearly “free” performance boost that makes running powerful LLMs locally faster and more practical than ever before.
To keep things accurate and useful, this article pulls from a mix of resources: technical white papers, benchmark results, open datasets, and hands-on testing by the community. We also point to solid research and trusted publications when it helps explain the trade-offs and techniques around running LLMs locally.
