As a home labber, I’ve always been an advocate for running services on my local hardware instead of relying on cloud platforms. And my stance on FOSS tools hasn’t changed even after I started running large language models. If anything, I’ve grown to love the 9B and 12B LLMs I’ve deployed on my server nodes, as I don’t have to worry about external firms gaining access to all the documents and log files I upload to my models.
But no matter how much I adore my local models, they can’t hold a candle to the hundreds of billions of parameters that Claude, Perplexity, ChatGPT, and other cloud-based models can crunch, especially for demanding coding tasks requiring massive context windows. Or at least, that’s what I thought until I began using Qwen3.6-35B-A3B on my gaming PC. And let me tell you, this powerhouse of an LLM can walk toe-to-toe with pricey cloud models for development workloads – all while running on my weak hardware.
Getting Qwen3.6-35B-A3B to run on my outdated GPU was a bit of a challenge
But with a few tweaks, I got this beast of a model to generate 20+ tokens/s
Before I talk about my experience with Qwen3.6, let me go over the hardware aspect of my LLM-hosting setup. As a broke home labber, the RTX 3080 Ti is the fastest GPU in my arsenal, and it has held up pretty decently for running 12B models, and even GPT-OSS-20B with tweaked parameters. However, it only has 12GB of VRAM and, to be honest, it’s terribly outdated for LLM-powered tasks in 2026. In theory, it would be difficult to run a 27B model on this card, let alone something as bulky as a 35B LLM. But as it turns out, it’s not only possible to load Qwen3.6-35B-A3B on my old system, but it can even drive this LLM at a respectable token generation rate of 24t/s.
Since it’s a mixture of experts model, I can use the --n-cpu-moe flag to offload some expert weights on the CPU instead of forcing them on my graphics card, while -ngl 999 ensures my GPU gets utilized for the KV cache and attention layers. Increasing the CPU threads via -t increases its computational prowess, but since I wanted enough context size for my coding tasks, I set -c to 65536. After asking our resident LLM maestro, Adam Conway, for some tips, I used the following command to get my llama-server instance running with Qwen3.6:
llama-server.exe -m "C:\Users\Ayush\.lmstudio\models\lmstudio-community\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-Q4_K_M.gguf" -c 65536 -ngl 999 --n-cpu-moe 30 -fa on -t 20 -b 2048 -ub 2048 --no-mmap --jinja
Just to test things out, I logged into the web UI created by llama-server and ran a quick query about XDA-Developers. To my surprise, the LLM was able to hit above 20 tokens/s, so I switched to llama-bench to see how far I could go. The max my system could handle was a 16K prompt length with -ctk set to q8_0‑style quantization (though setting -ctv to q8_0‑style would cause it to crash).
llama-bench.exe -m "C:\Users\Ayush\Downloads\LLMs\Qwen3.6-35B-A3B-Q4_K_M.gguf" -p 16000 -n 256 -ngl 999 --n-cpu-moe 32 -fa on -t 20 -b 2096 -ub 2096 -ctk q8_0
In fact, one look at the system resources confirmed that the 32GB RAM on my PC was the bottleneck with these commands. So, if I could track two more 16GB sticks (which may be plausible once the RAM apocalypse ends), I could push the context window and quantization settings even further. But since I wanted to check whether Qwen3.6 could replace its cloud counterparts for my specific tasks, it was time to pair it with my code editors, agentic tools, and typical FOSS apps.
Qwen3.6-35B-A3B is a beast of an LLM
It’s a rock-solid coding companion
Besides my productivity tasks (which I’ll get to in a moment), I extensively use LLMs as my coding companions. And no, I’m not talking about vibe-coding tasks, either. Rather than letting clankers create apps, I use them to troubleshoot tasks, rearrange syntax, and autocomplete my code. Up until now, I’d cycle between Qwen2.5-Coder and DeepSeek R1 for my coding tasks, and while they were decent companions, I’d often have to run them a couple of times and specify the context in great detail before they’d start dishing out helpful suggestions.
How much do you know about Claude?
Trivia challenge
Think you know Anthropic's AI assistant? Put your knowledge of Claude to the test.
Which company created Claude?
What is the name of the safety and values framework Anthropic developed to guide Claude's behavior?
What is the name most commonly associated with inspiring Claude's name?
Which of the following best describes Claude's context window capability in its more advanced versions?
Which of the following principles is NOT part of Anthropic's core goal for Claude?
What was a key distinguishing feature of Claude 2 when it launched compared to many rival models at the time?
Anthropic describes itself primarily as which type of company?
Which of the following tasks is Claude specifically designed to handle well?
Your Score
Thanks for playing!
Qwen3.6, on the other hand, gives solid troubleshooting tips from the very first prompt – to the point where it correctly identifies where my Home Assistant automation was malfunctioning after I paired it with Claude Code and uploaded the trigger-action YAML file and HASS logs. Heck, I’ve even run it through a couple of other terminal logs, and aside from two configs (which, in all fairness, didn’t have enough details about the errors), it pinpointed the faulty packages and configs accurately. Likewise, I paired it with the Continue extension on VS Code, and it’s great at autofilling my variable names and modifying syntax to fit different languages.
I’ve also paired it with a handful of self-hosted FOSS tools
If you’ve read my articles at XDA, you’re probably aware that I use LLMs with certain LXCs and Docker containers. Well, Qwen3.6 is just as effective for my productivity-centric toolkit. Thanks to llama-server’s OpenAPI-compliant nature, I can pair this LLM with everything from my Paperless-ngx companion apps to good ol’ Open WebUI. In fact, I’ve paired it with OpenNotebook, and it’s significantly better than every other local model I’ve used for aggregating my research notes. Likewise, I’ve also added it to Blinko, where it goes through my notes and answers all my queries.
The best part? Not only do I get to keep my notes off the prying eyes of cloud-based AI models, but I also don’t have to spend extra bucks on subscription fees. For example, I’ve been feeding Claude Code error logs every time my home lab breaks, and without Qwen3.6-35B-A3B, I’d have to pay for API usage. And since this 35B behemoth is good enough for research, I can just pair it with Open WebUI + SearXNG instead of relying on ChatGPT.
llama.cpp
Llama.cpp is an open-source framework that runs large language models locally on your computer.
