The first thing many people look at when picking a local model is the parameter count. The logic is usually that a bigger number equals a better model. I thought that too once getting started with self-hosting LLMs, and it’s not entirely wrong. But after running my go-to 20B model into a wall with longer prompts, I switched to a 9B model and got better results. Not because it’s inherently better than 20B, but because the one I switched to was built for a much larger context window, and that ended up mattering more for how I actually use LLMs.
Context window is basically your model’s working memory. Everything in your conversation, your prompt, and the response it generates has to fit inside it. If your model has a massive parameter count but a tiny context window, it’s going to choke on anything longer than a few paragraphs. Another thing I hadn’t considered was how badly the default settings were getting in my way, and how much of the “limitation” I’d been blaming on my hardware was actually just a settings problem I hadn’t looked at yet.
What I’m working with
I switched to something smaller but more capable
My setup is pretty basic and modest. I’m running a GPU with 8GB of VRAM, and I’m running everything through LM Studio - it was the first runner I tried and I liked the GUI, so it just stuck. Up until recently, OpenAI’s gpt-oss 20B was my go-to model. With STEM and general knowledge as its strong suit, 20 billion parameters, and up to 128k tokens, I figured it was a strong middle ground between something that could actually run on my hardware and something capable enough to be useful. It ran smoothly on my setup through GPU offloading, even though it’s officially designed for 16GB of VRAM.
For most things, it was fine. I would primarily prompt it for quick bursts of information or a bit of brainstorming. But the problem showed up when I started throwing longer prompts at it. Its context limits as well as my limited VRAM became more apparent when I put it up against Claude, tasked with creating a short UX design self-study curriculum - it kept hitting the context wall.
So a colleague pointed me towards the Qwen family of models specifically because of GDN (Gated DeltaNet), a hybrid architecture that handles context very differently to a standard transformer like gpt-oss. Basically, standard transformers grow a KV (key value) cache for every token in your context - the longer a conversation, the more VRAM it eats. GDN replaces most of those layers with a fixed-size memory state, so VRAM usage stays mostly constant even with longer context length.
Now, I’m working with Qwen 3.5 9B (q4_k_m), which is significantly smaller than my gpt-oss 20B, about half the size. But it has a much larger context window (up to 262k) and uses context more efficiently without eating up VRAM, thanks to GDN. So whereas I could previously bump up context length to around 30K for gpt, my PC could barely handle it, but I can go far above that with Qwen and my PC is fine.
Qwen3.5-9B tops every AI benchmark right now, but that's not how you should pick a model
There's a lot more to a model than just benchmarks.
Qwen 3.5 9B has a much larger context window
It didn’t work well at first, but that was on me
I first attempted Qwen with the same long prompt I wrote about in this LLM comparison article - a prompt for a UX research study guide. And it didn’t go much better than it did with gpt-oss as it couldn’t generate the full course. Keep in mind that the context length was set to 16k at this point. My first instinct was that Thinking Mode was the culprit - Qwen runs Thinking by default, so it burns through a chunk of your token budget just reasoning before it even gets started on the response. So I turned it off… and it still failed.
At this point I went into LM Studio’s settings. I noticed Limit Response Length was ticked on and capped at 1,643 tokens, which meant Qwen was being cut off mid-response regardless of anything else. I messed around with the sliders the night before when writing this article about prompting tips and completely forgot I turned it on! After turning that off, it was smooth sailing, with one thing worth noting: Qwen has a tendency to overthink and over-explain, even with Thinking turned off. But a system prompt can keep that in check - instruct it to be concise, skip the preamble, stick to what you actually asked for, and not narrate its reasoning process.
Beyond the system prompt, it’s also worth tweaking a few parameters in your runner. The ones that matter most for reining in Qwen’s verbosity are the presence and repetition penalties (nudged up), and min-p (keep low). Temperature depends on what you’re doing - lower for precision, higher for general use. With context length bumped up to 30k, Thinking turned off, and my parameters tweaked, Qwen generated a really practical and actionable study guide. It turned out to be much more comprehensive than anything gpt-oss ever gave me.
And then I started to push the context limits
After seeing what Qwen was capable of with the same context length, and my fans weren’t even ramping up, I wanted to see how far I could push it. It didn’t necessarily generate better results with a higher context length apart from being a bit more detail-oriented. So the real test was going to be how much of the context it actually remembers as the chat got longer, which is also more realistic for how I use AI - lots of back-and-forth.
So I decided to try the needle in a haystack test. Basically, you take a massive wall of text (haystack), hide a specific piece of information inside it (needle), and ask the model to retrieve the needle. I generated a long wall of text that would equal to around 50k tokens, and hid some key phrases in there. When Qwen was set to 30k context length, it couldn’t find it, but at 60k, it was able to find it. This pretty much confirmed that the context window was working as expected. It wasn't just a setting I'd bumped up and hoped for the best, the model was genuinely attending to content across the full 60k tokens, early, middle, and late in the text. For a 9B model running on 8GB of VRAM, that's pretty solid.
Based on the needle test, I can run much longer sessions than I ever could with gpt-oss without the model losing the thread. I’ve been using it for UX and design search queries, study sessions, and general back-and-forth where context tends to accumulate. I’m sitting comfortably at 60k and nothing has cut off or gone sideways yet - though it puts me at 7.6 out of 8GB dedicated VRAM used, so that's close to the ceiling. For a 9B model on modest hardware, this headroom feels genuinely different to what I was working with before.
7 things I wish I knew when I started self-hosting LLMs
I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.
The model size isn’t the whole story
I honestly avoided the Qwen family for a while because every time I saw it come up, it was in the context of coding benchmarks - and I don’t code. So I swore it off as a developer tool and stuck to general-purpose models. Turns out that reputation sells it badly for everything else it’s capable of. If context window matters to your workflow - studying, research, coursework, and so on - architecture is worth paying attention to before parameter count. The B number isn’t nothing, but it’s not the whole story either.
