Voozh

As a home labber, I’ve always been an advocate for running services on my local hardware instead of relying on cloud platforms. And my stance on FOSS tools hasn’t changed even after I started running large language models. If anything, I’ve grown to love the 9B and 12B LLMs I’ve deployed on my server nodes, as I don’t have to worry about external firms gaining access to all the documents and log files I upload to my models.

But no matter how much I adore my local models, they can’t hold a candle to the hundreds of billions of parameters that Claude, Perplexity, ChatGPT, and other cloud-based models can crunch, especially for demanding coding tasks requiring massive context windows. Or at least, that’s what I thought until I began using Qwen3.6-35B-A3B on my gaming PC. And let me tell you, this powerhouse of an LLM can walk toe-to-toe with pricey cloud models for development workloads – all while running on my weak hardware.

Getting Qwen3.6-35B-A3B to run on my outdated GPU was a bit of a challenge

But with a few tweaks, I got this beast of a model to generate 20+ tokens/s

Before I talk about my experience with Qwen3.6, let me go over the hardware aspect of my LLM-hosting setup. As a broke home labber, the RTX 3080 Ti is the fastest GPU in my arsenal, and it has held up pretty decently for running 12B models, and even GPT-OSS-20B with tweaked parameters. However, it only has 12GB of VRAM and, to be honest, it’s terribly outdated for LLM-powered tasks in 2026. In theory, it would be difficult to run a 27B model on this card, let alone something as bulky as a 35B LLM. But as it turns out, it’s not only possible to load Qwen3.6-35B-A3B on my old system, but it can even drive this LLM at a respectable token generation rate of 24t/s.

Since it’s a mixture of experts model, I can use the --n-cpu-moe flag to offload some expert weights on the CPU instead of forcing them on my graphics card, while -ngl 999 ensures my GPU gets utilized for the KV cache and attention layers. Increasing the CPU threads via -t increases its computational prowess, but since I wanted enough context size for my coding tasks, I set -c to 65536. After asking our resident LLM maestro, Adam Conway, for some tips, I used the following command to get my llama-server instance running with Qwen3.6:

llama-server.exe -m "C:\Users\Ayush\.lmstudio\models\lmstudio-community\Qwen3.6-35B-A3B-GGUF\Qwen3.6-35B-A3B-Q4_K_M.gguf" -c 65536 -ngl 999 --n-cpu-moe 30 -fa on -t 20 -b 2048 -ub 2048 --no-mmap --jinja

Just to test things out, I logged into the web UI created by llama-server and ran a quick query about XDA-Developers. To my surprise, the LLM was able to hit above 20 tokens/s, so I switched to llama-bench to see how far I could go. The max my system could handle was a 16K prompt length with -ctk set to q8_0‑style quantization (though setting -ctv to q8_0‑style would cause it to crash).

llama-bench.exe -m "C:\Users\Ayush\Downloads\LLMs\Qwen3.6-35B-A3B-Q4_K_M.gguf" -p 16000 -n 256 -ngl 999 --n-cpu-moe 32 -fa on -t 20 -b 2096 -ub 2096 -ctk q8_0

In fact, one look at the system resources confirmed that the 32GB RAM on my PC was the bottleneck with these commands. So, if I could track two more 16GB sticks (which may be plausible once the RAM apocalypse ends), I could push the context window and quantization settings even further. But since I wanted to check whether Qwen3.6 could replace its cloud counterparts for my specific tasks, it was time to pair it with my code editors, agentic tools, and typical FOSS apps.

Qwen3.6-35B-A3B is a beast of an LLM

It’s a rock-solid coding companion

Besides my productivity tasks (which I’ll get to in a moment), I extensively use LLMs as my coding companions. And no, I’m not talking about vibe-coding tasks, either. Rather than letting clankers create apps, I use them to troubleshoot tasks, rearrange syntax, and autocomplete my code. Up until now, I’d cycle between Qwen2.5-Coder and DeepSeek R1 for my coding tasks, and while they were decent companions, I’d often have to run them a couple of times and specify the context in great detail before they’d start dishing out helpful suggestions.

👁 XDA
Quiz

8 Questions · Test Your Knowledge

How much do you know about Claude?
Trivia challenge

Think you know Anthropic's AI assistant? Put your knowledge of Claude to the test.

OriginsCapabilitiesSafetyFeaturesDesign

01 / 8Origins

Which company created Claude?

Correct! Claude was created by Anthropic, an AI safety company founded in 2021. Anthropic was co-founded by Dario Amodei and Daniela Amodei, among others who previously worked at OpenAI.

Not quite. Claude is made by Anthropic, not to be confused with OpenAI, which makes ChatGPT. Anthropic was founded in 2021 with a strong focus on AI safety research.

02 / 8Safety

What is the name of the safety and values framework Anthropic developed to guide Claude's behavior?

Correct! Anthropic developed Constitutional AI (CAI), a technique that trains Claude using a set of principles — a 'constitution' — to guide its responses toward being helpful, harmless, and honest.

Not quite. The framework is called Constitutional AI (CAI). It is a novel training approach pioneered by Anthropic that uses a written set of principles to help the model self-critique and improve its own outputs.

03 / 8Origins

What is the name most commonly associated with inspiring Claude's name?

Correct! Claude Shannon is widely cited as the inspiration behind the name. Shannon founded information theory, which is foundational to all modern computing and digital communication — a fitting namesake for an AI.

Not quite. The name Claude is most commonly associated with Claude Shannon, the mathematician and electrical engineer who founded information theory. His pioneering work laid the groundwork for the digital age.

04 / 8Capabilities

Which of the following best describes Claude's context window capability in its more advanced versions?

Correct! Advanced versions of Claude support context windows of 100,000 tokens or more, allowing it to process entire books, lengthy codebases, or large documents in a single conversation — a standout feature at the time of its release.

Not quite. Claude's advanced versions support context windows of 100,000 tokens or more. This was a significant leap beyond many contemporaries and allows Claude to reason over very large amounts of text in one session.

05 / 8Design

Which of the following principles is NOT part of Anthropic's core goal for Claude?

Correct! Anthropic's guiding principles for Claude are to be Helpful, Harmless, and Honest — often called the 'three H's.' Hierarchical is not part of this framework. The goal is to make AI that is safe and beneficial for everyone.

Not quite. Anthropic's three guiding principles for Claude are Helpful, Harmless, and Honest. 'Hierarchical' is not one of them. These three H's shape how Claude is trained to interact with users responsibly.

06 / 8Features

What was a key distinguishing feature of Claude 2 when it launched compared to many rival models at the time?

Correct! Claude 2 launched with a 100,000-token context window, which was remarkable at the time. This allowed users to feed in entire books or massive codebases for analysis, setting Claude apart from many competing models.

Not quite. The standout feature of Claude 2 was its 100,000-token context window. Claude does not natively generate images, and real-time browsing and built-in voice were not launch features of Claude 2.

07 / 8Safety

Anthropic describes itself primarily as which type of company?

Correct! Anthropic describes itself as an AI safety and research company. Unlike some competitors who lead with products or platforms, Anthropic's founding mission centers on building AI systems that are safe, interpretable, and steerable.

Not quite. Anthropic is primarily an AI safety and research company. Its founding mission is rooted in making AI that is safe and understandable, which is why safety-focused training methods like Constitutional AI are central to its work.

08 / 8Features

Which of the following tasks is Claude specifically designed to handle well?

Correct! Claude excels at long-form writing, summarization, coding assistance, and complex reasoning tasks. Its large context window and nuanced language understanding make it particularly well suited for handling detailed, multi-step text-based work.

Not quite. Claude is designed for text-based tasks like writing, summarization, analysis, and reasoning. It does not render graphics, autonomously execute system commands, or perform live video analysis — it is a large language model at its core.

Challenge Complete

Your Score

/ 8

Thanks for playing!

Qwen3.6, on the other hand, gives solid troubleshooting tips from the very first prompt – to the point where it correctly identifies where my Home Assistant automation was malfunctioning after I paired it with Claude Code and uploaded the trigger-action YAML file and HASS logs. Heck, I’ve even run it through a couple of other terminal logs, and aside from two configs (which, in all fairness, didn’t have enough details about the errors), it pinpointed the faulty packages and configs accurately. Likewise, I paired it with the Continue extension on VS Code, and it’s great at autofilling my variable names and modifying syntax to fit different languages.

I’ve also paired it with a handful of self-hosted FOSS tools

If you’ve read my articles at XDA, you’re probably aware that I use LLMs with certain LXCs and Docker containers. Well, Qwen3.6 is just as effective for my productivity-centric toolkit. Thanks to llama-server’s OpenAPI-compliant nature, I can pair this LLM with everything from my Paperless-ngx companion apps to good ol’ Open WebUI. In fact, I’ve paired it with OpenNotebook, and it’s significantly better than every other local model I’ve used for aggregating my research notes. Likewise, I’ve also added it to Blinko, where it goes through my notes and answers all my queries.

The best part? Not only do I get to keep my notes off the prying eyes of cloud-based AI models, but I also don’t have to spend extra bucks on subscription fees. For example, I’ve been feeding Claude Code error logs every time my home lab breaks, and without Qwen3.6-35B-A3B, I’d have to pay for API usage. And since this 35B behemoth is good enough for research, I can just pair it with Open WebUI + SearXNG instead of relying on ChatGPT.

llama.cpp

Llama.cpp is an open-source framework that runs large language models locally on your computer.

See at Official Website

URL: https://www.xda-developers.com/i-replaced-chatgpt-and-claude-with-this-local-llm/

⇱ I replaced ChatGPT and Claude with this powerful local LLM and saved over $20 a month while gaining full control

Getting Qwen3.6-35B-A3B to run on my outdated GPU was a bit of a challenge

But with a few tweaks, I got this beast of a model to generate 20+ tokens/s

Qwen3.6-35B-A3B is a beast of an LLM

It’s a rock-solid coding companion

How much do you know about Claude?
Trivia challenge

Your Score

I’ve also paired it with a handful of self-hosted FOSS tools

llama.cpp

URL: https://www.xda-developers.com/i-replaced-chatgpt-and-claude-with-this-local-llm/

⇱ I replaced ChatGPT and Claude with this powerful local LLM and saved over $20 a month while gaining full control

Getting Qwen3.6-35B-A3B to run on my outdated GPU was a bit of a challenge

But with a few tweaks, I got this beast of a model to generate 20+ tokens/s

Qwen3.6-35B-A3B is a beast of an LLM

It’s a rock-solid coding companion

How much do you know about Claude?Trivia challenge

Your Score

I’ve also paired it with a handful of self-hosted FOSS tools

Subscribe for hands-on guides to running Qwen3.6 locally

llama.cpp

How much do you know about Claude?
Trivia challenge