When I first got into self-hosting LLMs, I went straight for a general-purpose model. OpenAI's gpt-oss 20B seemed like a safe pick - it was well-rounded, and it didn't lean too hard into any one domain. I ran it through LM Studio and used it for everything from quick queries to longer research sessions. It was fine for a while, but the cracks started showing with longer, more detailed prompts.
I'd seen the Qwen family come up a lot, but every time I looked into it, the conversation was always about coding benchmarks and developer workflows. So I kind of just wrote it off as a coder's model. But after my gpt-oss started hitting walls, I was pointed toward Qwen 3.5 9B specifically because of how it handles context. I gave it a shot and honestly, it ended up becoming the model I reach for every day now - for things that have nothing to do with code.
I replaced my local LLM with a model half its size and got better results — And it wasn't about the parameters
I switched from a 20B model to a 9B one, and it was better
A 9B model has no right being this good
And it runs on basically anything
Before Qwen, I had this assumption that anything under 12B-ish was going to feel subpar. And that wasn't completely baseless - I'd tried Google's Gemma-3n-E4B before, which runs on a completely different architecture, and it kind of confirmed that bias.The context window caps at 32K, the responses lack depth on anything past surface-level queries, and you can just tell the model is working with less. So when I went from a 20B model down to Qwen 3.5 9B, I expected more of the same. But it was leagues ahead of gpt-oss - not just in context, but in reasoning and multimodal tasks. The responses were more structured, more detailed, and I wasn't re-prompting nearly as much, which honestly matters more than anything when you're going back and forth for longer sessions.
Want to stay in the loop with the latest in AI? The XDA AI Insider newsletter drops weekly with deep dives, tool recommendations, and hands-on coverage you won't find anywhere else on the site. Subscribe by modifying your newsletter preferences!
The reason it punches above its weight comes down to how it's built. Qwen 3.5 9B uses a hybrid architecture called Gated DeltaNet (GDN) that handles context very differently from standard transformers. Instead of growing its memory usage the longer a conversation gets, it maintains a mostly fixed memory state, which means it doesn't eat up your VRAM the way a traditional model would with longer sessions. It supports up to 262k tokens natively - for a 9B model, that's pretty wild. It comes in at around 6.6GB, it comfortably fits on an 8GB GPU, and at 4-bit quantization (Q4_K_M), it requires 5.1 - 5.7 GB VRAM. So when I say it runs on basically anything, I mean you don't need expensive hardware to get genuinely useful output from it - I can push the context length up to 60k on 8GB VRAM without issue.
The other thing that caught me off guard was the multimodal support. Qwen 3.5 9B handles text, images, and even video natively from the same model weights - there's no separate vision components you need to download or configure on top of it. You just throw a screenshot or a document image at it and it processes it directly. I didn't even know this was a thing for a model this size until I tried it - most of the sub-10B models are text-only.
What I actually use it for every day
And none of it involves a terminal
The biggest irony of this whole thing is that I avoided the Qwen family for months because I thought it was a coding-first model. And it is, sort of - the Qwen3-Coder-Next variant is genuinely built for developers. But the 3.5 9B is actually a general-purpose model perfect for users like me, and it slotted into my workflow pretty naturally. I use it for understanding concepts, breaking down dense material, getting structured explanations on topics I'm researching, and just having a back-and-forth when I need to think something through. It's the same kind of use I get out of Claude or Gemini, just offline and without a usage cap.
What keeps me coming back is that it gives me thoughtful responses without needing a lot of hand-holding. I create short weekend courses with it - and even when I don’t give it structured instructions, it still gives me a reasonable and actionable guide. It also reads my PDF files in an instant and can easily sum up a 160-page document. I’m mostly impressed with its image analysis abilities, though. To demonstrate, I added an image without any text or clear shapes, just the fur of my cat on a blanket, and it described the scene accurately.
I finally found a local LLM I actually want to use for coding
Qwen3-Coder-Next is a great model, and it's even better with Claude Code as a harness.
It's not plug-and-play though
A few settings make a huge difference
Qwen 3.5 9B doesn't behave perfectly out of the box. Thinking Mode is on by default, but I’d turn it off if you want to preserve tokens - it can burn through them just with one long thinking process before it even gets to the response. I’d also bump up the context length as high as your VRAM allows since the default in most runners is pretty conservative and you're leaving the best part of this model on the table if you don't. Presence penalty at 1.5 and repeat penalty at 1 helps with the over-explaining for general tasks. And a system prompt telling it to be concise and skip the preamble makes more difference than any single slider.
Free, local, and I'm not complaining
I didn't expect a 9B model to become my daily driver, and I definitely didn't expect it to come from a family I'd written off as a developer tool. But Qwen 3.5 9B does what I need it to do, runs on my modest hardware without complaint, and costs me absolutely nothing. It's not going to replace Claude for everything, but for offline, day-to-day use, it's earned its spot in my rotation.
