LM Studio has been my default runner for as long as I've been running local LLMs, which is more than long enough now to call it part of my daily flow rather than just something I'm experimenting with anymore. The appeal of LM Studio is pretty much that it has a GUI, one-click installs, and no command-line stuff to wade through. Development is not my domain, even using the terminal isn't, so a user-friendly runner was important to me and is what got me comfortable with self-hosting AI in the first place.
But the more I've leaned on local models for the stuff I actually don't want a cloud chatbot to touch, the more I've started running into the limits of what LM Studio can actually do. Some models don't quite work properly in it, so some flagship features in newer models go untouched because my runner just doesn't support them yet. A coworker mentioned llama.cpp to me a while ago and at first I filed it as the developer option, but when I finally caved, it became clear I've been gatekeeping myself out of something a lot more approachable than I assumed.
Want to stay in the loop with the latest in AI? The XDA AI Insider newsletter drops weekly with deep dives, tool recommendations, and hands-on coverage you won't find anywhere else on the site. Subscribe by modifying your newsletter.
The terminal-based runner I avoided for no real reason
I got it running in five minutes
For the longest time, llama.cpp lived in my head as the option you graduate to if you really know what you're doing. Every setup guide I'd scroll past would open with something about installing a compiler, and I'd close the tab before the page finished loading. None of that turned out to be true for my use case though. The GitHub releases page has prebuilt binaries for Windows, Mac, and Linux, with separate builds depending on your hardware. Literally all it took was downloading, unzipping, running one command in the terminal, and that's it.
llama.cpp is an open-source C++ runtime for running large language models locally, built by Georgi Gerganov in March 2023 right after Meta dropped the LLaMA weights. And actually, llama.cpp is the core backend engine for LM Studio, Ollama, and most other local AI apps you've heard of. They're essentially wrappers built around it, so going direct cuts out the middleman. It also ships with llama-server and a built-in web UI you access through your browser, so the actual chatting can happen in a clean GUI.
There are real reasons to use it over a GUI runner. Wrappers add overhead, so the same model on the same hardware runs noticeably faster in llama.cpp than in LM Studio, somewhere in the 5-20% range depending on your setup. llama.cpp also tends to support new models first because it's the upstream project everything else is built on, whereas LM Studio and Ollama have to wait for an update cycle. So you're not waiting on your runner to catch up to use the newest open weights.
Ollama is still the easiest way to start local LLMs, but it's the worst way to keep running them
Ollama is great for getting you started... just don't stick around.
What pushed me out of LM Studio
I'm kind of starting to outgrow it
I'm not actually going to completely stop using LM Studio,and it's not a bad tool. But the more I've leaned on local models for actual workflows, the more I've started bumping into things LM Studio just doesn't support yet - and the model I wanted to use was one of them.
That model was Gemma 4 E4B. Its standout feature is native audio input - automatic speech recognition and speech-to-translated-text across multiple languages - and that capability only exists on the two edge variants, E2B and E4B. None of the larger Gemma 4 models have it. So if your runner doesn't do audio, you're using a multimodal model with one of its modes turned off. LM Studio doesn't support audio at all, meaning I couldn't actually use Gemma to its full capacity, and that's what finally pushed me to give llama.cpp a spin.
LM Studio
What actually changed since switching to llama.cpp
A few unexpected wins along the way
The first thing I noticed was something I wasn't even trying to fix and actually forgot about: Gemma 4's reasoning bleed in LM Studio, where it combines its thinking with the response. It wasn't there anymore. llama.cpp's WebUI puts reasoning in a separate collapsible box, so you see the response and can optionally expand the thinking if you want to. I'd resigned myself to living with that bug in LM Studio, but llama.cpp fixed it.
The audio side was the actual reason I switched, and it works well. You can't hit a record button without doing some workarounds, but uploading WAV files is a bit quicker than setting that up anyway. Its audio analysis is near-perfect based on my tests of voicing the same prompts I sent in text - it interpreted them the same. Image analysis is also meeting my expectations. It goes beyond text and can also interpret organic objects and understand the context of a photo.
But the value I was actually getting was from the built-in functions. It has individual session system prompts, a cleaner way to hook up MCP servers, and way more parameter controls than LM Studio. Since I don't use llama.cpp for development but just research and daily tasks, these controls are where most of the upgrade was for me.
Most of the parameters beyond the regulars (temp, min-p, and repeat and presence penalty) aren't as applicable to my workflow, but I found one that does actually make a noticeable difference. DRY (Don't Repeat Yourself) is a smarter version of repeat penalty - it only kicks in when whole phrases start repeating, not individual tokens, so I'm not getting the side effect of the model avoiding common words like "the" unnaturally.
Mirostat and Dynamic Temperature are also worth a mention. Both adjust how the model picks tokens in real-time based on the actual response, instead of you locking everything to a fixed value upfront. This gives you more consistent prose within longer sessions. But these are more suited for workflows that lean more creative than technical.
7 things I wish I knew when I started self-hosting LLMs
I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.
I should have switched sooner
I switched because the model I wanted to use had moved ahead of the tool I was running it in, and llama.cpp could actually keep up. That's probably going to keep happening as open models get more capable, so it's worth having a runner that doesn't lag behind. Also, despite llama.cpp primarily being aimed at devs, you don't have to be one to use it. Once the GUI is running, it's basically regular chat but just faster, cleaner, and with way more controls. LM Studio still has its place for casual use, but llama.cpp has truly been an upgrade.
