Local LLMs have this annoying middle ground problem. They're good enough that you can see the potential, but just slow enough to get in the way. You really feel the value in them, but the experience never quite clicks.
That's usually where the upgrade cycle starts. You assume the model is the weak link, so you go hunting for something bigger, smarter, or less compromised. I did the same thing.
What took me longer than it should have is realizing the model wasn't really the issue. It was how it was running. Fix that, and the same model that felt frustrating suddenly became a model that I really wanted to use.
Here's how I get the most out of my self-hosted LLM, especially when limited by VRAM
Don't have an RTX 5090? No problem!
Running a local LLM is easy until you actually try to use it every day
Five minutes to set up, five hours to realize you don't actually want to use it
Getting a local LLM running is the easy part. You install LM Studio (I did, at least), pull a model, type a question, and it answers. The first few interactions feel impressive enough that it's easy to think you're done. Then you try to use it like a normal tool, and that's when the cracks start to show.
The answers are fine, sometimes even great, but the experience often is not. Responses take just a little too long. Not broken slow, just enough to break your flow. Conversations lose momentum. Small delays start to add up, and once you notice it, you can't unnotice it.
7 things I wish I knew when I started self-hosting LLMs
I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.
This is usually where the hobby begins. Instead of using the model, you start tuning it. You compare different quants, maybe switch front ends, adjust context size, and keep an eye on tokens per second. Eventually, the whole project starts to feel like something you're trying to fix instead of something you're trying to use.
None of this means local LLMs are bad. In fact, they're often close to being genuinely useful. Close enough that you can see the version of this that works perfectly. And that stretch, the part between "almost usable" and "actually useful," is where most of the time goes.
I thought the answer was a better model
When in doubt, download something bigger and hope it fixes everything
When a local LLM feels off, the natural reaction is to blame the model. Maybe it's too small, too heavily quantized, or just the wrong one. So you start upgrading. You try a larger model, switch to a higher precision quant, or jump to a different family entirely. The goal then becomes finding the smartest model that still fits your hardware.
That logic sounds right, but it's not always the real issue. Weak output is often blamed on intelligence, when the bottleneck is actually elsewhere. Setup, context limits, and inference speed can matter just as much as the model itself.
This is why the same model on the same hardware can feel completely different depending on how it's run. Model quality matters, but it's not always the thing holding you back.
The tweak that changed everything was speculative decoding
Turns out the breakthrough wasn't more power, just less wasted effort
The breakthrough didn't come from downloading yet another model with a bigger number attached to it. It came from using a feature that changes how the same model does its job.
Instead of one model grinding through every token, you bring in a second, smaller model to sprint ahead and make educated guesses. The larger model then checks those guesses. If they're correct, it accepts them instead of generating each token itself.
This is called speculative decoding, but it's easier to think of it as a draft and verify combo. The smaller model drafts. The larger model approves. When I started using speculative decoding, I immediately noticed the difference. You're no longer watching the model slowly assemble a sentence like it's thinking out loud, one word at a time. The responses come back much faster.
Nothing about the model's intelligence changed. It just stopped wasting effort. That alone was enough to turn what felt like a project to me to something I actually wanted, and enjoyed, using.
Speculative decoding matters more than most tuning settings
You can tweak personality all day, but speed is what decides if you come back
If you've spent time with local LLMs, you've probably adjusted a few settings here and there, like temperature, context size, penalties, maybe swap quantizations. The model does behave a little differently when you make these tweaks, but they mostly change behavior or output quality. But it doesn't change how the model actually feels to use.
Speculative decoding is different. It changes speed, which changes everything. The answers arrive much faster, which makes the entire experience smoother. This matters more than people think, because other common tweaks usually keep you in a loop of constant adjustment. You're always testing, comparing, and trying to squeeze out slightly better results.
This is one of the few changes that improves usability right away.
I stopped looking for a better model and started using the one I had
At some point, it hit me that the model wasn't the issue. It was already good enough. It just felt like using it required patience I didn't have. Once that changed, my urge to keep searching for better models disappeared.
That was the real lesson for me. Local LLM performance is about how you run it, not just about what you run. You don't always need a better model. Sometimes you just need to stop making the one you have work so hard.
