Voozh

I've been running local LLMs for a long time. I have a Radeon 7900 XTX in my homelab, I've benchmarked the things to death, and I've written before about benchmarks don't necessarily always reflect real-world performance. But for quick, easy queries, or anything where I wanted a fast, sharp answer to a question with some good sources, ChatGPT or Claude were tools that I'd reach for. The cloud still had a quality gap that local models were only starting to close.

Then Alibaba dropped Qwen-3.6-27B in April, and the gap I thought existed turned out to be much smaller than I'd assumed. It's a dense 27-billion-parameter model that fits comfortably on a single consumer GPU, and on day-to-day work, it's been keeping up shockingly well with the cloud assistants people often pay for.

I know what you're thinking, and no, it doesn't keep up on everything. However, it's a lot closer than I thought it would be, and Qwen 3.6 27B even beat ChatGPT on a reverse engineering test that I conducted. It's a dense model and it isn't tiny, so you'll want a decent GPU, and it's not perfect either. However, it runs on my hardware, fully locally, all without rate limits.

A gaming PC is enough hardware now

The 7900 XTX does the heavy lifting

My setup isn't anything exotic, just a Radeon 7900 XTX with 24GB of VRAM, the same card a lot of people already have for gaming. I've got the GPU passed through to a Proxmox LXC using the setup I wrote about previously, and llama.cpp is the inference server doing the work. It's a machine that might even be somewhat similar to what you might have under your desk right now.

Qwen 3.6 27B is a dense model, not a Mixture-of-Experts. A lot of the recent flagship-class open weights have been MoE designs where you get the parameter count on paper but only a small slice is active at any given time. Dense means every parameter is doing work on every token, which tends to give you more consistent quality at the cost of needing both faster memory and more memory. At Q4_K_M quantization the model sits in around 18GB of VRAM, leaving comfortable headroom on the 7900 XTX for a long context window.

llama.cpp gives me two viable AMD paths on this card: ROCm/HIP and Vulkan. I tested both, and the split was interesting. ROCm was better at prompt processing, but Vulkan was roughly 30% faster at generation on the same Qwen3.6-27B model. Since generation speed is what I feel most in a normal chat, Vulkan is the backend I'd use for that experience, while ROCm still makes sense for workloads that involve repeatedly ingesting large prompts.

At around 90,000 tokens of context loaded, llama.cpp's logs when using ROCm show roughly 285 tokens per second on prompt processing and around 15 tokens per second on generation. At smaller contexts the generation rate climbs considerably, to around 25 tokens per second on a fresh chat with a small prompt. Using Vulkan, it generates at 37 tokens per second. That's still a bit slower than what's possible for this card with this model, as there's more speed to be extracted using the likes of vLLM.

You see, Qwen 3.6 27B ships with native multi-token prediction heads that vLLM can use as a speculative drafter, with no separate draft model needed, and ROCm has been a first-class vLLM platform with MTP support since v0.14.0, with Qwen 3.6 support landing in v0.19.0. Reported speedups for MTP sit between 1.2x and just over 2x on comparable setups, which would put real-world generation well above what I'm seeing on llama.cpp today. I'm sticking with llama.cpp for now because it's the stack I know best, but I'll eventually make the switch to vLLM when I find the time to.

It handles most of what I'd send to ChatGPT

And even surpasses it

The bulk of what I'd use ChatGPT Plus for is research, proofreading, summarization, quick coding questions, and the occasional "explain this to me in plain English" prompt. For example, when I was researching for my IPv9 article that I wrote about last year, ChatGPT's research capabilities were phenomenal at discovering Chinese forum posts that put me on the right track and gave me the ability to report on some aspects of that story that I'd never seen in English-speaking media before. However, none of that needs a trillion-parameter cloud model. It needs a basic assistant that responds quickly, follows instructions properly, and doesn't hallucinate too aggressively, and Qwen 3.6 27B is comfortably good enough at all of that. Especially the hallucination part, which I'll get to.

On benchmarks, Qwen's own results have the 27B dense model edging ahead of the previous Qwen3.5-397B-A17B MoE on SWE-bench Verified, at 77.2% versus 76.2%. That's a dense 27B beating a near-400-billion-parameter MoE in software engineering work specifically. I've already talked about how there's more to models than these benchmarks, but that's still an incredibly impressive result.

To give you an idea of just how capable this model is, I pointed Qwen 3.6 27B at a small custom-made TCP authentication service with a couple of planted vulnerabilities, and it caught both of them on a single pass with no hand-holding. It also reconstructed the wire protocol from the disassembly, identified the magic value bytes correctly, and flagged two unplanted issues that genuinely apply alongside the ones I'd planted. When I handed the same binary to GPT-5.4 for comparison, the cloud model caught the hardcoded secret and the single-connection DoS, but it explicitly concluded the compare logic had no bypass available, which meant it missed the timing oracle that Qwen had already named. It also hallucinated the magic value as the wrong four bytes. It's one test on one binary, not a claim that Qwen 3.6 27B is better than GPT-5.4 at reverse engineering in general. However, these local models are a lot more capable than most people give them credit for.

For something harder to find, I asked both models a niche question about llama.cpp speculative decoding with Qwen 3.6 27B's native MTP heads, and whether the Vulkan backend supported them. Both correctly tracked down draft PR #22673 in the llama.cpp repo, flagged that native MTP isn't merged to master yet, and both correctly described the Vulkan situation: garbage output in early testing on a Radeon RX 9700, the author later updating the comment to say Vulkan and Metal had been tested. Qwen's response was a clean, usable summary that included the practical caveats, like the prefill slowdown and the broken --mmproj support, plus the genuinely useful observation that the generic draft-model path can actually outperform native MTP in some configurations. ChatGPT's answer was wordier and more abstract, dropping jargon like "target context does not support partial sequence removal" without explaining what that meant, and missed the practical takeaway about the generic draft path entirely. Both found the right PR; one of them was the more useful read.

For a more typical research question, I gave Qwen 3.6 27B and ChatGPT 5.5 the same beginner prompt about setting up ESPHome from scratch. Qwen pulled fresh information through SearXNG and produced a walkthrough with citations to ESPHome's docs, the Home Assistant community forum, Reddit, and a handful of beginner guides. Its hardware picks were the right ones for a first-time user, namely an ESP32 DevKit V1, which is the most documented board with no PSRAM flags or quirks to deal with, and a DHT22 sensor that costs about five dollars and runs on a single GPIO pin. It also gave me both a pip install and a Docker option. ChatGPT's response went the other way, recommending a newer ESP32-S3 with USB-C and a pricier BME280, padding the answer with warnings about boards no beginner would ever buy in the first place like the Wi-Fi-less ESP32-P4, and skipping the pip install path entirely. On the actual question I asked, Qwen's answer was the more practical of the two.

When I sit down and ask it to summarize a piece of text, clean up a script, or autonomously research a topic I'm interested in, it gives me what I need on the first pass more often than not. For the work that fills my day, the local model is present, fast, and good enough that I don't really need to use ChatGPT. The ESPHome and llama.cpp tests were actually eye-opening as well, and I found myself wondering:

How can a local model consistently give better answers?

Pi and a local SearXNG close the rest of the gap

It all runs on my own network

A harness is half the experience with any LLM, cloud or local. I've used Claude Code with local models before and it works, but this is model dependent. I have my suspicions that Qwen 3 Coder Next is trained on Claude Code tool calling specifically, as in other harnesses, it kept trying to make tool calls that looked like Claude tools, not Pi or OpenCode tools. For the likes of Qwen 3.6 27B, that isn't an issue, so I use Pi, the coding agent from Mario Zechner. It's a minimal terminal harness built to be extensible, with the same core feel as Claude Code or OpenCode.

Pi gives the model four core tools, namely read, write, edit, and bash, and otherwise stays out of the way. Point it at any OpenAI-compatible endpoint and it's happy, so llama.cpp serving Qwen 3.6 27B locally is a one-line config change away, with no proxy layer to maintain, no auth, and no need to convince the harness that the model on the other end isn't from Anthropic. Pi also supports TypeScript extensions, skills, prompt templates, and themes, and you can publish your own packages through npm or git. I've built up a small handful of extensions that fit the way I work, and that's something I'd never get to do with a closed harness. Pi can also handle compaction, and the 200k token context window is the only place where you feel the difference, though being smart about compaction is all you really need. I also use Rust Token Killer, a CLI tool designed for agentic workflows that reduces the token consumption of bash commands. Because Pi is extensible, you can ask the model to build its own extensions, and that's exactly what I did with Qwen 3.6 here: I asked it to build a hook so that all bash calls would be intercepted and piped into rtk's "rewrite" feature, so that the token impact of bash commands was reduced.

The piece Pi doesn't hand the model is the open web, which is where SearXNG comes in. I run it locally and expose it to Qwen 3.6 27B as an MCP server, so the model can search the web, pull in fresh information, and reason over it the same way ChatGPT does with its browsing tool. It's all happening on my own infrastructure, so there's no cap on queries (like you'd get with Brave Search or Tavily) and there's no risk of the feature being silently deprecated. I have Context7 wired up as a second MCP server for library and API documentation, which keeps the model from making up function signatures or fishing in stale docs from its training data.

Combined with the bash and file tools Pi provides, the agentic feel is fully there. A year ago, tools were the one thing that made cloud models feel obviously better than local ones. That was a cloud advantage, but now it isn't.

I rarely use ChatGPT these days

Or any cloud model for that matter

Qwen 3.6 27B isn't a complete replacement for ChatGPT and pretending that it was would be silly. For hard reasoning, long multi-step planning over a large codebase, or anything that needs broad world knowledge, the largest cloud models should still pull noticeably ahead. The top-tier GPT and Claude releases have a depth that no 27B model is going to match, and I don't think the gap closes entirely any time soon. There are also tasks where the cloud model is just easier, like pasting a long document into ChatGPT and getting an answer in five seconds as opposed to a minute, which is a lot more convenient to work with.

Local LLMs have always been great for specific purposes, and even significantly earlier models were still incredibly capable with MCP servers or being used as a processing aid in a wider pipeline. Now, though, local models are extremely good and the cost of using them is surprisingly low. Whenever I've talked about local LLMs, I've often been told that my electricity bill must be through the roof. However, I live in one of the most expensive countries in the world for electricity currently, and local LLMs have barely contributed to that. My peak usage costs $0.62 per kilowatt hour, and I have extensive power monitoring in my home. Trust me, if running a local LLM meaningfully impacted my energy costs, I'd know.

Just a year ago, the things Qwen 3.6 27B can do wouldn't have been possible without a workstation that cost thousands of dollars. Today, it runs on a gaming PC with a single consumer GPU. It's good enough that for a lot of tasks, it genuinely gives better results than the cloud models do. For anyone who's been waiting for local LLMs to feel useful rather than impressive, that wait is more or less over.

URL: https://www.xda-developers.com/replaced-chatgpt-local-model-beating-cloud-didnt-expect/

⇱ I replaced ChatGPT with a local model on my gaming PC, and it's beating the cloud where I didn't expect