Voozh

There's a version of local LLMs that lives in my head from how they were a couple of years ago. That they’re slow, clunky, need expensive hardware to run anything worth using, and outputs that feel like a worse version of what you already have in your browser. That mental model made sense at the time because local models really were like that for a while, and the barrier to entry was high enough to write them off if you weren't a serious tinkerer.

I only actually tried one about four months ago, and I was pretty wrong about most of it. Not wrong in the sense that those limitations never existed - they did - but wrong in the sense that I was still treating them as dealbreakers long after that had mostly stopped being true. The hardware bar isn’t as high anymore, the interfaces got way more approachable, and the models themselves got genuinely capable. I'm still figuring them out, honestly, but that's kind of the point.

👁 A MacBook air connected to a monitor running DeepSeek-R1 locally

7 things I wish I knew when I started self-hosting LLMs

I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.

By Adam Conway

The hardware bar is lower than you think

A gaming PC from a couple of years ago is apparently all you need

Credit:

The hardware thing was my biggest assumption, and also the one that fell apart fastest. I had my PC built a couple of years ago as a decent gaming rig - it’s nothing exotic, just something that could handle modern games without struggling. I don’t really speak hardware, but just for reference, I’m working with RTX 3070 and 8GB VRAM. I definitely didn't get it with local AI in mind because that wasn't even on my radar.

It runs local LLMs fine. I can push a 20B model with GPU offloading without it falling apart, and the model I actually use day-to-day is Qwen 3.5 9B, which I can run at a 60k context window in my runner. That last part is partly down to the model's architecture - Qwen 3.5 uses something called GDN, which handles long contexts without the usual memory blowup you'd get from a standard transformer model. So instead of VRAM usage climbing as context grows, it stays flat. A 9B model holding 60k tokens of context on 8GB VRAM is not something I would have believed was possible before actually trying it myself. Most of the Qwen models use GDN, so if long context on limited VRAM is what you're after, that's probably where you're looking.

Latency was another thing I had wrong. I expected it to feel sluggish - and it kind of did with my first batch of models. But Qwen 3.5 9B on my setup runs at somewhere around 40-50 tokens per second, which in practice just means it feels responsive. Not identical to a cloud model, but not the painful crawl I was expecting either.

What a local LLM looks like in practice right now

It covers more of the same ground as cloud AI than you’d expect

I come across a lot of local LLM content in the context of coding, and I’ll just say up front that’s not my wheelhouse at all - I’m just a regular user who uses AI the way most people do. My runner of choice is LM Studio, which has a clean and easy interface that never felt like I was doing anything particularly technical. It's basically just a chat window with some controls.

The use case I keep coming back to is building study materials. I'm doing design coursework at the moment, and what I've been doing is feeding Qwen my official course docs and having it generate structured course content from them - short weekend study guides and exercises. But I also just prompt it on the spot when I need to work through something I don't have material for yet. Qwen 3.5 9B punches above its size because it was trained on knowledge distilled from a much larger 397B model - so you're getting more capability than the parameter count suggests. And I also have Brave Search MCP hooked up so it can pull from the web if it doesn’t know something.

Whenever Qwen gives me something good enough to work with, I convert the entire conversation to a PDF file using this LM Studio converter tool, so I can treat it like a real document as part of my course material and open it anywhere.

Image analysis is another thing I didn’t expect it would be so good at. On the MMMU-Pro visual reasoning benchmark, Qwen3.5-9B scored 70.1, which is not far behind leading models like Gemini 3 Pro and GPT-5.4 that score in the 80% range, and explains why it reads my screenshots with such high accuracy. I upload screenshots of UI designs and ask it to flag inconsistencies or improve the design - basically the same way you'd use Claude for design feedback, but locally. It also handles real-life images with organic subjects just as well as screenshots and digital content, which opens it up beyond the obvious design and document use cases.

In fact, Qwen 3.5 9B does lead on a lot of AI benchmarking. However, as this article discusses, while benchmarks don’t mean nothing, they’re not how you should pick your model - it depends on what you actually use AI for. All I know is that it’s great at handling long context, which is why I use it just about as much as Gemini and Claude.

👁 gemma-4-feature-image

Google's Gemma 4 isn't the smartest local LLM I've run, but it's the one I reach for most

Google's newest Gemma 4 models are both powerful and useful.

By Adam Conway

The privacy thing is actually a real benefit

Some data is better kept local

The one thing I do genuinely appreciate that I didn't expect to is the privacy side of it. Nothing I type into LM Studio leaves my machine - no training data, server logs, or terms of service clause about how my inputs might be used to improve the product. For most of what I use AI for, that doesn't matter much. But there are topics I'd rather not run through a cloud model - health stuff, finances, anything where I'd have to hand over personal context to get a useful answer. For that kind of thing, local just makes more sense, and I don't have to think twice about what I'm sharing.

Several months in and I'm still finding new things to do with it. The assumptions I had about local LLMs being slow, hardware-hungry, and only worth it if you're technical - most of them didn’t hold up. If you've been putting it off for the same reasons I was, it's probably worth an afternoon to set one up and find out how wrong you might be too.

URL: https://www.xda-developers.com/local-llms-are-good-now-wasted-months-not-realizing-it/

⇱ Local LLMs are actually good now, and I wasted months not realizing it