📚 Related: Qwen 3.6 Local Guide · Ollama Troubleshooting · Ollama on Mac (0.30 / MLX) · VRAM Requirements
Three tools dominate local LLM inference: llama.cpp, Ollama, and vLLM. Every benchmark post gives you a different “winner.” The honest answer is that you can pick correctly without reading any of them, because the decision pivots almost entirely on one question.
Are you one developer at a keyboard, or are you serving many concurrent requests?
That single fork resolves most of the debate. Single-user, the three are much closer than the comparison posts admit, and most of the tok/s gap you see traces to quantization (Q4_K_M vs FP16), not the engine. Concurrent, vLLM wins by 10-20x and it’s not close. The category framing (Ollama is an experience layer, llama.cpp is an engine, vLLM is a serving system) is the reason, not the lead.
Here’s the practical decision walked end-to-end, with the sourced numbers, the vLLM VRAM gotcha that catches 24GB 3090 owners on day one, and what changed in mid-2026 worth knowing about.
The Single-User Path
If you’re the only user — running models on your own machine, building a local app for yourself, prototyping, doing one conversation at a time — the three engines are close. Closer than benchmark headlines suggest, because most of the headline gap is the quantization format, not the runtime.
A representative single-stream comparison, Llama 3.1 8B on an RTX 4090, from community benchmarks compiled in early 2026 (codersera roundup, Markaicode benchmark):
| Engine | Format | Throughput (one request) |
|---|---|---|
| Ollama | Q4_K_M GGUF | ~62 tok/s |
| llama.cpp (direct) | Q4_K_M GGUF | ~65 tok/s (Ollama wrapper adds ~5-15% overhead in most reports; up to ~30% in some) |
| vLLM | FP16 | ~71 tok/s |
vLLM is faster, but it’s running FP16, while Ollama is running a 4-bit quant. Equalize the format and the gap collapses. The runtime is not the determining factor at this scale; quantization choice is. A Q4_K_M GGUF in Ollama and an AWQ-Int4 in vLLM finish much closer than the headline 62-vs-71 comparison implies.
Pick by ergonomics, not throughput:
- Ollama — easiest.
ollama run qwen3.6:27band you’re chatting. Handles downloads, versioning, model swaps, REST API. The cost is a 5-15% wrapper overhead (sometimes higher) over running llama.cpp directly, which is invisible for desktop use. - llama.cpp direct — control. You pick the quant, the flags, the build, the offload layout. Worth it when you need a feature Ollama hasn’t exposed (recent example: MTP speculative decoding landed in mainline llama.cpp via PR #22673 on May 16, 2026; Ollama tracks upstream features but on a lag).
- vLLM — overkill for single-user. The serving infrastructure does nothing for one developer at one keyboard, and the VRAM pre-allocation behavior (covered below) actively gets in the way.
For solo desktop use, the answer is Ollama unless you have a specific reason to drop down to llama.cpp.
The Concurrent Serving Path
The picture inverts the moment you have more than one user.
Multiple community benchmarks through May and June 2026 report vLLM at roughly 10-20× Ollama’s throughput once concurrent requests are in play. A representative snapshot:
| Concurrent requests | Ollama (aggregate tok/s) | vLLM (aggregate tok/s) | Ratio |
|---|---|---|---|
| 1 (single user) | ~62 | ~71 | ~1.1x |
| 8 | ~82 | ~187 | ~2.3x |
| 50 | ~155 | ~920 | ~5.9x |
| 100+ (stress) | plateaus, requests queue and serialize | continues to scale with batch size | ~15-20x |
Sources: Markaicode throughput benchmark, SitePoint Ollama vs vLLM 2026, codersera May 2026 runtime update. Exact numbers vary by model, GPU, and quantization, but the shape is consistent across reports: vLLM scales with concurrency, Ollama plateaus.
The reason is architectural, not implementation polish:
- PagedAttention — vLLM stores the KV cache in fixed-size, non-contiguous blocks, the way an OS handles virtual memory pages. Wasted KV cache space drops dramatically, so the same VRAM holds many more concurrent sequences. Ollama (and llama.cpp underneath) allocates KV cache contiguously per request; fragmentation kills concurrent capacity.
- Continuous batching — vLLM forms a new batch every iteration rather than waiting for all in-flight requests to finish. A short prompt doesn’t wait behind a long one. Ollama queues; under load, it serializes.
For multi-user serving, vLLM isn’t 23% faster than the alternatives. It’s a different category of system. Anything resembling a production API endpoint, anything serving more than a handful of simultaneous users, the answer is vLLM (or SGLang or TGI on the same architectural family). Ollama for a production API is the most common expensive mistake in this space.
The vLLM VRAM Gotcha (Read This Before Day One)
This is the one that catches 24GB 3090 owners on the first try, and it’s a real surprise if no one warns you.
vLLM pre-allocates roughly 90% of GPU VRAM at startup for its KV cache pool. That’s not a bug, it’s the design — the gpu_memory_utilization flag defaults to 0.90 (vLLM docs). The KV cache pool is what makes PagedAttention efficient: vLLM grabs a big contiguous reservation up front so it can manage paged blocks inside it without fighting the CUDA allocator.
The practical consequence on a 24GB RTX 3090:
- You load a 7B model in AWQ-Int4. Weights: ~5 GB.
- vLLM grabs another ~17-18 GB for the KV cache pool.
- Total claim on the card: ~22-23 GB out of 24.
- Try to run a second GPU process (your Stable Diffusion UI, a Whisper container, anything) and you OOM immediately.
vLLM assumes the GPU belongs to one model server. That assumption is correct on a datacenter A100 or H100. It’s an architectural mismatch on a consumer 24GB card you’re trying to share between an LLM and other work.
By contrast, llama.cpp and Ollama are conservative. They load model weights plus only the actual context window’s KV cache. The same 7B AWQ-equivalent in Ollama on the same 3090 leaves you with 18+ GB free for other GPU processes.
If you must use vLLM on a 24GB card alongside other GPU work, drop --gpu-memory-utilization to 0.5 or lower. You lose some concurrent-batching headroom, but the card stops being monogamous. For single-user serving on a 24GB card with other GPU tasks running, this gotcha alone often pushes the answer back to Ollama or llama.cpp.
What Each Tool Actually Is (Categories, Briefly)
The reason the decision splits along one-user-vs-many is that the three tools aren’t competing for the same job. They’re different categories of software.
llama.cpp — the engine
Georgi Gerganov’s C++ inference engine. Reads GGUF, runs on basically anything — NVIDIA, AMD, Intel, Apple Silicon (Metal), WebGPU, OpenCL, pure CPU. Maintained under ggml-org; current mainline builds at b9670+ as of mid-June 2026 (per the releases page). It’s not a user-facing application; it’s the thing every other tool either wraps or competes with.
Recent additions worth knowing: MTP speculative decoding landed in mainline (PR #22673, merged 2026-05-16) — the older llama-mtp fork workaround is no longer needed for new builds. MCP client support, an autoparser for structured output across model templates, and ongoing speed work on Qwen 3.5/3.6 / linear-attention architectures all shipped earlier in the spring.
Ollama — the experience layer
A wrapper around llama.cpp that handles the parts users don’t want to think about: model downloads, versioning, automatic model swapping, a built-in REST API, sensible defaults. Currently at 0.30.x (mid-June 2026).
What’s specifically true about Ollama on Apple Silicon is worth a flag, because the inference backend changed twice in three months:
- Ollama 0.19 (March 31, 2026) swapped llama.cpp Metal for MLX on Apple Silicon as a preview, reportedly nearly doubling decode speed for safetensors models on qualifying hardware (Yage.ai analysis, Sebastian Gingter writeup).
- Ollama 0.30 layered llama.cpp Metal back in alongside MLX, auto-routing by format — MLX for safetensors, llama.cpp Metal for GGUF. The result: GGUF compatibility restored, MLX speed preserved for safetensors. The Ollama Mac setup guide has the full picture.
On non-Apple platforms (Linux, Windows), Ollama still wraps llama.cpp directly. The wrapper overhead lands somewhere in the 5-15% range under most measurements; some configurations report up to ~30% overhead. Invisible for desktop use, occasionally noticeable at the edges.
vLLM — the serving system
A production inference engine built around PagedAttention and continuous batching. Written in Python, runs on NVIDIA CUDA (first-class) and AMD ROCm (datacenter cards, MI300/MI350 — not consumer RDNA). Current version 0.23.0 (released June 13, 2026; ships CUDA 13.0 binaries by default per the vLLM releases page).
vLLM is what’s running behind a lot of production AI endpoints. It’s not built for one developer at one keyboard. It’s built for “many users, one server,” which is the opposite of the desktop use case Ollama targets.
Brief mentions: MLX and LM Studio
Two adjacent tools that come up in the same conversation but aren’t co-equal options here:
- MLX is Apple’s array framework. It’s an inference engine for Apple Silicon (safetensors models), sitting in the same category slot as llama.cpp does on other platforms. If you’re on a Mac, Ollama 0.30 picks it for safetensors models for you. If you want to run MLX directly, that’s a different guide. For the “which engine do I pick” question, MLX is “what Ollama uses on Apple Silicon for safetensors.”
- LM Studio is a desktop GUI experience layer in the same slot as Ollama, with a heavier focus on the graphical interface and model browsing. Same category-of-tool as Ollama; different ergonomic preference.
Neither is in scope for “which of the three should I run,” since they’re either inside Ollama (MLX) or a different ergonomic pick (LM Studio).
Platform Reality
The cross-platform story is uneven and worth naming directly:
| Platform | llama.cpp | Ollama | vLLM |
|---|---|---|---|
| Linux | Native, fully supported | Native, fully supported | Native, fully supported (CUDA + ROCm) |
| Windows | Native | Native | WSL2 only — no native Windows build |
| macOS / Apple Silicon | Native (Metal) | Native (MLX + llama.cpp Metal in 0.30) | Not production-ready — experimental, datacenter focus |
If you’re on Windows or Mac, the “many users” answer changes shape. On Windows, vLLM works through WSL2 fine, but it’s a layer of indirection. On a Mac, vLLM isn’t realistically an option for production serving — the Mac inference story is llama.cpp + Metal or MLX, full stop, and Ollama wraps both.
For solo Mac users, the Best Local LLMs for Mac 2026 guide covers the picks; for setup specifics, the Mac setup guide covers the 0.30 MLX + Metal split.
Model Swap Behavior
A subtle difference that matters when you’re running more than one model:
- Ollama hot-swaps. When you ask for a different model, it unloads the current one and loads the next one. Two requests to two models will queue — but you can switch freely without restarting.
- vLLM locks one model in VRAM. A vLLM server serves the model it started with. Switching means tearing down and restarting the server with new arguments. The KV cache pool is allocated for that model’s geometry; you can’t just hot-swap.
- llama.cpp does whatever you script it to do. By default
llama-serverruns one model; tools like llama-swap provide Ollama-like hot-swapping with raw llama.cpp control.
For development and experimentation, hot-swap matters a lot. For production, a single locked model is usually what you want.
What’s Different in Mid-2026 Worth Knowing About
A few updates from spring/early summer that change the recommendations vs older comparison guides:
- llama.cpp MTP merged in mainline (PR #22673, May 16). The older
llama-mtpfork workaround is obsolete; the--spec-type draft-mtppath now lives in mainline. Speculative decoding for MoE models is generally usable from any recent build. - Ollama 0.30 added llama.cpp Metal alongside MLX on Apple Silicon, restoring GGUF compatibility that the 0.19 MLX-only switch had removed. Auto-routes by model format.
- vLLM 0.23.0 shipped CUDA 13.0 binaries by default (June 13). The PyTorch 2.10 / FlashAttention 4 work from 0.17 carries forward; recent point releases focused on Qwen 3.5/3.6 maturity, FP8 quantization, and pipeline-parallel improvements.
- The vLLM Mac story still isn’t production-ready as of mid-June 2026. If you see “vllm-mlx” references in comparison posts, treat them as experimental.
Quick Decision Tree
Walk this from the top; first match wins.
- Are you serving multiple concurrent users (more than ~3-5)? → vLLM. Mind the VRAM gotcha; on consumer cards, drop
gpu-memory-utilizationif you need the GPU for other work. - Are you on a Mac? → Ollama (it handles the MLX vs Metal split for you), or llama.cpp direct if you want control. vLLM is not the answer here.
- Are you building or experimenting solo? → Ollama. The wrapper overhead is invisible and the ergonomics save real time.
- Are you hitting an Ollama limitation (specific quant, specific flag, brand-new mainline feature)? → drop down to llama.cpp direct.
- Do you need CPU inference, heavy CPU offload, or unusual hardware? → llama.cpp is the only real option.
- Are you running MoE on a single consumer GPU with hybrid CPU+GPU expert offload? → mainline llama.cpp handles this well now; the
ik_llama.cppfork is the power-user route if you want the last 20% of MoE throughput.
Bottom Line
The “llama.cpp vs Ollama vs vLLM” question is really two questions in a trench coat.
For one developer at one keyboard, the three engines are close. The 62-vs-71 tok/s gap that benchmark posts lead with is mostly the quantization choice, not the runtime. Pick by ergonomics: Ollama for ease, llama.cpp for control, vLLM is overkill for solo use and its 90% VRAM pre-allocation actively hurts on consumer cards.
For serving many users, the answer is vLLM, and it’s not close. PagedAttention plus continuous batching gets you 10-20× the concurrent throughput. The architecture is built for “many users, one server” — that’s the job vLLM was designed for, and Ollama wasn’t.
Cross-platform reality colors both answers. On Linux, all three are first-class. On Windows, vLLM is WSL2-only. On Mac, vLLM isn’t realistically in the picture; Ollama wraps MLX + Metal and handles the split intelligently in 0.30.
The most common mistake: using Ollama for what should be a vLLM job. The second most common: trying to use vLLM as a personal local-AI engine on a 24GB card you also need for other GPU work, then wondering why everything else OOMs. The third: chasing single-stream tok/s benchmarks across runtimes when the quantization choice was driving most of the variance.
# Solo desktop use — start here
ollama run qwen3.6:27b
# Solo + need a specific flag or brand-new feature
./llama-server -m model.gguf -ngl 99 -c 8192
# Serving concurrent users
vllm serve Qwen/Qwen3.6-27B --quantization awq
# Serving on a shared 24GB GPU — back off the VRAM grab
vllm serve Qwen/Qwen3.6-27B --quantization awq --gpu-memory-utilization 0.5
For the broader hardware context that drives a lot of these decisions, the VRAM requirements guide and the GPU buying guide cover the cards. For Ollama-specific issues, the troubleshooting guide covers the common ones. If you’re running Qwen 3.6 specifically, the local guide and the 3090-specific tuning piece are the next reads.
Related Guides
- Qwen 3.6 Local Guide
- Qwen Models Guide
- Ollama Troubleshooting Guide
- Ollama on Mac (0.30 / MLX setup)
- Ollama Not Using GPU — Fix
- VRAM Requirements for Local LLMs
- GPU Buying Guide for Local AI
- Best Local LLMs for Mac (2026)
- Is Multi-GPU Worth It?
- Fix Slow Qwen 3.6-27B on RTX 3090
- Function Calling with Local LLMs
- Open WebUI Setup Guide
Get notified when we publish new guides.
Subscribe — free, no spam