If you've ever looked into running AI models on your own hardware, you've almost certainly come across Ollama. It's the default recommendation in many different places, it's what most YouTube tutorials point you toward, and it's the tool that a lot of self-hosting guides treat as the starting point for local inference. For good reason, too, because getting a model running with Ollama is as easy as typing "ollama run gpt-oss-20b" and waiting. It's the Docker of local LLMs, and that comparison isn't a coincidence, given that some of the people behind Ollama came from the Docker world.
Ollama's convenience comes at a cost, though, and once you understand what's happening under the hood, it's hard to justify using it over the alternatives. It's slower than it needs to be, it makes choices you can't easily override, and the project itself has been moving in a direction that should concern anyone who cares about open-source software. I've been running local LLMs for a while now, and I switched away from Ollama a long time ago for most of my projects.
Ollama is slower than the tools it's built on
And it hides the settings that would fix it
The most immediate problem with Ollama is performance. Multiple community benchmarks and developer reports have shown that running the same model through Ollama produces fewer tokens per second compared to running it through llama.cpp directly, and this is a problem that keeps recurring. Sometimes it's a pretty big difference, and it's a tangible gap that you can feel when you're waiting on output while using it.
Part of the problem comes down to Ollama's defaults. The context window, for example, will default to 4,096 tokens for most people, and it used to be even lower before. It dynamically sets the default context length depending on how much VRAM is available, but those dynamically set context windows only kick in on GPUs with more than 24GB of VRAM. Even Ollama's own documentation states that you should use 64,000 tokens of context at minimum for "tasks which require large context like web search, agents, and coding tools."
That's incredibly small when looking at modern models that use significantly less VRAM for the KV cache. Gemma 4 supports up to 128K or 256K context depending on the model, and newer architectures can reduce the memory burden of long context enough that a 4K default now feels badly out of step with how these models are actually used.
If you don't manually set num_ctx to something higher either via an environment variable, a command, or through the Ollama API, Ollama can become a bottleneck in long-context workloads, especially once prompts start exceeding the narrow defaults it ships with. The problem is that this isn't especially obvious when Ollama is sitting behind another tool, especially for the beginners Ollama is supposed to help.
On top of that, Ollama's abstraction layer adds overhead that raw llama.cpp doesn't have. The nullmirror team documented their switch from Ollama to llama.cpp and found consistent throughput improvements across every model they tested, with no quality tradeoff. Their conclusion was put pretty bluntly, as they stated that throughput and control mattered more than the convenience Ollama offered.
Ollama & Local LLMs
Trivia Challenge
Think you know your way around running AI models locally — put your Ollama expertise to the test!
What is the primary purpose of Ollama?
Which Ollama CLI command is used to download and run a model for the first time?
Which of the following model families is NOT natively available through Ollama's model library?
When running an LLM with Ollama, what hardware component has the biggest impact on inference speed?
Ollama exposes a local REST API by default. What is the default port it listens on?
In Ollama, what does the model tag ':7b' typically indicate?
After the initial model download, what data does Ollama send to external servers during a chat session by default?
What is a 'Modelfile' in Ollama used for?
Your Score
Thanks for playing!
The trust problem is harder to ignore
Naming things is hard, but this was a choice
Performance is a tradeoff you can choose to accept, but trust is different, and Ollama has been slowly losing it over time.
When DeepSeek released its R1 model family in early 2025, Ollama listed the smaller distilled versions, models like DeepSeek-R1-Distill-Qwen-32B, simply as "DeepSeek-R1" in their library. This created massive confusion. Social media was flooded with people claiming they were running "the" DeepSeek-R1 on consumer hardware, when in reality they were running much smaller distilled variants that behave nothing like the full 671-billion-parameter model. Ollama knew the difference, and they chose to obscure it anyway, presumably because "DeepSeek-R1" drives more downloads than "DeepSeek-R1-Distill-Qwen-32B" does. Even now, "ollama run deepseek-r1" will launch the 8B Qwen3-derived distilled variant.
Then there's the infrastructure itself. Ollama stores models using hashed filenames in its own registry format, which makes it surprisingly difficult to take your downloaded models and use them with another inference engine. If you've been pulling models through Ollama for months, you can't just point LM Studio or llama.cpp at those files without extra work. It's a form of vendor lock-in that most people won't even notice until they try to leave. You can bring your own GGUFs to Ollama by creating a Modelfile, but it's not as easy to bring your Ollama models to other platforms.
About a year ago, Ollama also moved away from using llama.cpp as its inference backend. They built a custom implementation on top of ggml, the lower-level library that llama.cpp itself uses. Their stated reason was stability: llama.cpp moves fast and breaks things, and Ollama's enterprise partners need reliability. That's a fair argument on paper. In practice, though, their custom backend has reintroduced bugs that llama.cpp solved years ago. Community members flagged broken structured output support and other regressions that simply don't exist in upstream llama.cpp.
There have also been complaints about MIT license attribution, where Ollama's binary distributions are accused of not properly crediting the llama.cpp authors whose work the project was built on. To make matters worse, its GUI application was not a part of the main GitHub repository at launch, its license was unclear, and the source code wasn't made available. The app code is in the repository now, but that only makes the earlier rollout look worse, not better. If your project trades on being open source, you do not get to be vague about what is and is not open at launch.
Ollama, to their credit, have tried to publicly give acknowledgement where it's due, and they stuck a "Thank you" note to the end of a blog post published at the same time when that controversy first unfolded.
Ollama is a Y Combinator-backed startup with venture capital funding and a growing team. None of that is inherently bad, but it does mean the project's incentives aren't purely community-driven. The confusingly launched desktop app, the friction-filled model registry, and the move away from llama.cpp all point in the same direction.
The alternatives are easier than you think
You don't need Ollama to keep things simple
The tools Ollama was built on top of are directly accessible, and in most cases, they're not much harder to set up, and llama.cpp is the obvious starting point. It's the go-to C++ inference engine that most of the local LLM world depends on, and it gives you direct control over everything Ollama abstracts away. You get an OpenAI-compatible API server, full control over context windows and sampling parameters, and consistently better throughput. If you want even more performance, ik_llama.cpp is a fork that pushes CPU and multi-GPU performance further, with three or four times speed improvements in some multi-GPU configurations.
Plus, if you're running multiple models and want Ollama-style automatic swapping, llama-swap handles that with a single YAML config file. It sits in front of llama.cpp and routes requests to the right model, spinning models up and down as needed. llama.cpp also has its own web GUI that you can access through your browser to interact with it.
If you want a GUI, LM Studio supports any GGUF model and exposes all of llama.cpp's optimization options through a clean interface. It doesn't have a proprietary model format, nor does it have a registry lock-in, and you can use the same model files with any other tool. On top of that, koboldcpp is another llama.cpp-based option with granular control over every sampling parameter and a built-in web UI.
vLLM is another popular option, and it's what I use with Claude Code on my ThinkStation PGX. It's the best option if you're serving models to multiple users or running agentic workflows, as it handles continuous batching and PagedAttention, which make a real difference in intense workloads. And if you want a polished frontend that works with any of these backends, Open WebUI plugs into all of them with ease.
None of these tools require more than a few minutes to set up. The idea that Ollama is the only beginner-friendly option doesn't hold up once you've actually tried the alternatives, as many have caught up with the ease-of-use that Ollama pioneered years ago.
Ollama served a purpose when local LLMs were new and the tooling around them was rough. That's just not the case anymore. The alternatives are faster, more transparent, and don't come with the baggage of a company making product decisions that increasingly prioritize control and product packaging over the clarity and interoperability that made the project popular. If you're still using it out of habit, it might be time to pull the plug.
