If you've ever looked into running AI models on your own hardware, you've almost certainly come across Ollama. It's the default recommendation in many different places, it's what most YouTube tutorials point you toward, and it's the tool that a lot of self-hosting guides treat as the starting point for local inference. For good reason, too, because getting a model running with Ollama is as easy as typing "ollama run gpt-oss-20b" and waiting. It's the Docker of local LLMs, and that comparison isn't a coincidence, given that some of the people behind Ollama came from the Docker world.

Ollama's convenience comes at a cost, though, and once you understand what's happening under the hood, it's hard to justify using it over the alternatives. It's slower than it needs to be, it makes choices you can't easily override, and the project itself has been moving in a direction that should concern anyone who cares about open-source software. I've been running local LLMs for a while now, and I switched away from Ollama a long time ago for most of my projects.

Ollama is slower than the tools it's built on

And it hides the settings that would fix it

The most immediate problem with Ollama is performance. Multiple community benchmarks and developer reports have shown that running the same model through Ollama produces fewer tokens per second compared to running it through llama.cpp directly, and this is a problem that keeps recurring. Sometimes it's a pretty big difference, and it's a tangible gap that you can feel when you're waiting on output while using it.

Part of the problem comes down to Ollama's defaults. The context window, for example, will default to 4,096 tokens for most people, and it used to be even lower before. It dynamically sets the default context length depending on how much VRAM is available, but those dynamically set context windows only kick in on GPUs with more than 24GB of VRAM. Even Ollama's own documentation states that you should use 64,000 tokens of context at minimum for "tasks which require large context like web search, agents, and coding tools."

That's incredibly small when looking at modern models that use significantly less VRAM for the KV cache. Gemma 4 supports up to 128K or 256K context depending on the model, and newer architectures can reduce the memory burden of long context enough that a 4K default now feels badly out of step with how these models are actually used.

If you don't manually set num_ctx to something higher either via an environment variable, a command, or through the Ollama API, Ollama can become a bottleneck in long-context workloads, especially once prompts start exceeding the narrow defaults it ships with. The problem is that this isn't especially obvious when Ollama is sitting behind another tool, especially for the beginners Ollama is supposed to help.

On top of that, Ollama's abstraction layer adds overhead that raw llama.cpp doesn't have. The nullmirror team documented their switch from Ollama to llama.cpp and found consistent throughput improvements across every model they tested, with no quality tradeoff. Their conclusion was put pretty bluntly, as they stated that throughput and control mattered more than the convenience Ollama offered.

👁 XDA
Quiz
8 Questions · Test Your Knowledge

Ollama & Local LLMs
Trivia Challenge

Think you know your way around running AI models locally — put your Ollama expertise to the test!

SetupModelsCommandsPrivacyHardware
01 / 8Basics

What is the primary purpose of Ollama?

That's right! Ollama is designed to let you download and run LLMs entirely on your own machine, with no internet connection required after setup. This means your conversations and data never leave your device.
Not quite — Ollama's whole mission is local inference. It lets you pull and run models like Llama, Mistral, and Gemma directly on your hardware, keeping everything private and offline after the initial download.
02 / 8Commands

Which Ollama CLI command is used to download and run a model for the first time?

Correct! 'ollama run' is the all-in-one command that pulls a model if it isn't already downloaded, then launches an interactive chat session immediately. It's one of the most commonly used commands in the Ollama CLI.
The correct command is 'ollama run'. It automatically handles downloading the model if needed and then drops you straight into a chat session — no separate pull step required on first use, though 'ollama pull' also exists.
03 / 8Models

Which of the following model families is NOT natively available through Ollama's model library?

Exactly right! GPT-4o is a proprietary OpenAI model only accessible via the OpenAI API — it cannot be run locally through Ollama. Ollama supports open-weight models like Llama 3, Mistral, Gemma, and many others.
The answer is GPT-4o. It's a closed, proprietary model owned by OpenAI and only accessible through their paid API. Ollama specializes in open-weight models like Llama 3, Mistral, and Gemma that can be legally downloaded and run locally.
04 / 8Hardware

When running an LLM with Ollama, what hardware component has the biggest impact on inference speed?

Spot on! VRAM is the key bottleneck for local LLM inference. If a model fits entirely in your GPU's VRAM, it runs dramatically faster than when it falls back to system RAM or CPU processing. More VRAM means you can run larger models at full speed.
The correct answer is GPU VRAM. While CPU speed matters for CPU-only inference, having enough VRAM to load the model onto your GPU is the single biggest factor in how fast responses generate. A GPU with 16GB VRAM can run models that would crawl on a CPU alone.
05 / 8Setup

Ollama exposes a local REST API by default. What is the default port it listens on?

Correct! Ollama's API server runs on port 11434 by default. This allows other local applications, scripts, and tools like Open WebUI to communicate with Ollama using standard HTTP requests, making it easy to build integrations.
The default port is 11434, which is a distinctive choice that avoids conflicts with common developer ports like 8080 or 3000. You can send requests to http://localhost:11434/api/generate to interact with models programmatically.
06 / 8Models

In Ollama, what does the model tag ':7b' typically indicate?

You got it! The 'b' stands for billion, so ':7b' means the model has roughly 7 billion parameters. More parameters generally means better reasoning ability but requires more VRAM. Common sizes include 3b, 7b, 13b, 34b, and 70b.
The ':7b' tag refers to 7 billion parameters — the 'b' is short for billion. Parameters are the learned weights that define a model's capabilities. Larger parameter counts typically produce smarter outputs but demand significantly more memory to run.
07 / 8Privacy

After the initial model download, what data does Ollama send to external servers during a chat session by default?

Exactly! Once a model is downloaded, Ollama operates entirely offline. Your prompts, responses, and conversation history never leave your machine. This makes it ideal for sensitive work, confidential documents, or simply those who value data privacy.
The correct answer is that no data is sent externally. Ollama's core value proposition is full local inference — after downloading the model, everything runs on your own hardware with zero telemetry or cloud communication during chat sessions.
08 / 8Advanced

What is a 'Modelfile' in Ollama used for?

That's correct! A Modelfile is Ollama's way of letting you customize and create your own model variants. You can set a system prompt to define a persona, adjust temperature, set context length, and even layer on top of existing base models — similar in concept to Docker's Dockerfile.
A Modelfile is a configuration script — similar to a Dockerfile — that lets you define a custom model persona. You can bake in a system prompt, tweak sampling parameters like temperature, and build a named custom model using 'ollama create' with your Modelfile.
Challenge Complete

Your Score

/ 8

Thanks for playing!

The trust problem is harder to ignore

Naming things is hard, but this was a choice

Credit: 

Performance is a tradeoff you can choose to accept, but trust is different, and Ollama has been slowly losing it over time.

When DeepSeek released its R1 model family in early 2025, Ollama listed the smaller distilled versions, models like DeepSeek-R1-Distill-Qwen-32B, simply as "DeepSeek-R1" in their library. This created massive confusion. Social media was flooded with people claiming they were running "the" DeepSeek-R1 on consumer hardware, when in reality they were running much smaller distilled variants that behave nothing like the full 671-billion-parameter model. Ollama knew the difference, and they chose to obscure it anyway, presumably because "DeepSeek-R1" drives more downloads than "DeepSeek-R1-Distill-Qwen-32B" does. Even now, "ollama run deepseek-r1" will launch the 8B Qwen3-derived distilled variant.

Then there's the infrastructure itself. Ollama stores models using hashed filenames in its own registry format, which makes it surprisingly difficult to take your downloaded models and use them with another inference engine. If you've been pulling models through Ollama for months, you can't just point LM Studio or llama.cpp at those files without extra work. It's a form of vendor lock-in that most people won't even notice until they try to leave. You can bring your own GGUFs to Ollama by creating a Modelfile, but it's not as easy to bring your Ollama models to other platforms.

About a year ago, Ollama also moved away from using llama.cpp as its inference backend. They built a custom implementation on top of ggml, the lower-level library that llama.cpp itself uses. Their stated reason was stability: llama.cpp moves fast and breaks things, and Ollama's enterprise partners need reliability. That's a fair argument on paper. In practice, though, their custom backend has reintroduced bugs that llama.cpp solved years ago. Community members flagged broken structured output support and other regressions that simply don't exist in upstream llama.cpp.

There have also been complaints about MIT license attribution, where Ollama's binary distributions are accused of not properly crediting the llama.cpp authors whose work the project was built on. To make matters worse, its GUI application was not a part of the main GitHub repository at launch, its license was unclear, and the source code wasn't made available. The app code is in the repository now, but that only makes the earlier rollout look worse, not better. If your project trades on being open source, you do not get to be vague about what is and is not open at launch.

Ollama, to their credit, have tried to publicly give acknowledgement where it's due, and they stuck a "Thank you" note to the end of a blog post published at the same time when that controversy first unfolded.

Ollama is a Y Combinator-backed startup with venture capital funding and a growing team. None of that is inherently bad, but it does mean the project's incentives aren't purely community-driven. The confusingly launched desktop app, the friction-filled model registry, and the move away from llama.cpp all point in the same direction.

The alternatives are easier than you think

You don't need Ollama to keep things simple

Credit: Shekhar Vaidya/XDA

The tools Ollama was built on top of are directly accessible, and in most cases, they're not much harder to set up, and llama.cpp is the obvious starting point. It's the go-to C++ inference engine that most of the local LLM world depends on, and it gives you direct control over everything Ollama abstracts away. You get an OpenAI-compatible API server, full control over context windows and sampling parameters, and consistently better throughput. If you want even more performance, ik_llama.cpp is a fork that pushes CPU and multi-GPU performance further, with three or four times speed improvements in some multi-GPU configurations.

Plus, if you're running multiple models and want Ollama-style automatic swapping, llama-swap handles that with a single YAML config file. It sits in front of llama.cpp and routes requests to the right model, spinning models up and down as needed. llama.cpp also has its own web GUI that you can access through your browser to interact with it.

If you want a GUI, LM Studio supports any GGUF model and exposes all of llama.cpp's optimization options through a clean interface. It doesn't have a proprietary model format, nor does it have a registry lock-in, and you can use the same model files with any other tool. On top of that, koboldcpp is another llama.cpp-based option with granular control over every sampling parameter and a built-in web UI.

vLLM is another popular option, and it's what I use with Claude Code on my ThinkStation PGX. It's the best option if you're serving models to multiple users or running agentic workflows, as it handles continuous batching and PagedAttention, which make a real difference in intense workloads. And if you want a polished frontend that works with any of these backends, Open WebUI plugs into all of them with ease.

None of these tools require more than a few minutes to set up. The idea that Ollama is the only beginner-friendly option doesn't hold up once you've actually tried the alternatives, as many have caught up with the ease-of-use that Ollama pioneered years ago.

Ollama served a purpose when local LLMs were new and the tooling around them was rough. That's just not the case anymore. The alternatives are faster, more transparent, and don't come with the baggage of a company making product decisions that increasingly prioritize control and product packaging over the clarity and interoperability that made the project popular. If you're still using it out of habit, it might be time to pull the plug.