Voozh

I've been running local LLMs for a while now on all kinds of devices. I have Ollama and Open WebUI on my home server, with various models running on my AMD Radeon RX 7900 XTX. It's always been functional, but never quite good enough to replace cloud-based coding assistants for real work. The models are typically too dumb to handle anything beyond basic autocomplete, and instead, I've played around with Claude Code a fair amount.

However, my aversion to local LLMs for coding changed when I pointed Claude Code at Qwen3-Coder-Next running on my Lenovo ThinkStation PGX. For the first time, I have a local LLM setup that I want to use, not one I'm using out of principle or stubbornness. It's fast, it handles real coding tasks, and the whole thing runs on hardware sitting on my desk. It's not a perfect replacement for cloud-based models, and I've hit its limitations. But for everyday coding work, it's the closest I've come to a local setup that actually feels like a real coding assistant.

The hardware makes it possible

Connecting Claude Code is incredibly easy

The Lenovo ThinkStation PGX is built around NVIDIA's GB10 Grace Blackwell Superchip. It's Lenovo's take on the DGX Spark, packed into a box roughly the size of a Mac Mini. The spec that matters most is 128GB of unified LPDDR5x memory shared between the CPU and GPU, and it's what makes running a model like Qwen3-Coder-Next practical.

Qwen3-Coder-Next has 80 billion parameters, but it uses an ultra-sparse Mixture-of-Experts architecture, and only 3 billion parameters are active for any given token. At Q4_K_M quantization, the whole thing fits in roughly 46GB. On the PGX, that leaves around 80GB of headroom for context windows, the OS, and whatever else I'm running. I've bumped that up to Q8_0 quantization, using around 85GB of VRAM, and I have a context window of 170,000 tokens.

The unified memory architecture is the reason this works. There's no PCIe bus bottleneck between host RAM and GPU VRAM, and the model sits in memory that the Blackwell GPU can access directly. On a traditional desktop with a discrete GPU, you'd need to fit the entire model in your GPU's VRAM or potentially accept catastrophic slowdowns as it spills into system RAM, unlike here, where all 128GB of VRAM is equally accessible.

Running the model and setting up Claude Code

Both are incredibly easy

The PGX ships with NVIDIA's DGX OS, which comes pre-configured with NVIDIA's AI software stack. Docker is ready to go, CUDA is already there, and the container runtime handles GPU passthrough. My setup is a Docker container running the inference server, with Claude Code pointed at it through environment variables.

The Docker command is pretty simple, pulling a Docker container from Nvidia's own registry and using vLLM:

docker run --rm -it --gpus all --ipc=host --network host \
--ulimit memlock=-1 --ulimit stack=67108864 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
nvcr.io/nvidia/vllm:26.01-py3 \
vllm serve "Qwen/Qwen3-Coder-Next-FP8" \
--served-model-name qwen3-coder-next --port 8000 --max-model-len 170000 --gpu-memory-utilization 0.90 \
--enable-auto-tool-choice --tool-call-parser qwen3_coder --attention-backend flashinfer --enable-prefix-caching \
--kv-cache-dtype fp8 --max-num-seqs 1

From there, I set a few environment variables to tell Claude Code where to find the model:

export ANTHROPIC_BASE_URL=http://192.168.1.179:8000
export ANTHROPIC_AUTH_TOKEN=
export ANTHROPIC_API_KEY=
export ANTHROPIC_MODEL=qwen3-coder-next
export ANTHROPIC_SMALL_FAST_MODEL=qwen3-coder-next
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export API_TIMEOUT_MS=600000

That's it. Claude Code doesn't care where its backend lives as long as the endpoint speaks the Anthropic Messages API, which vLLM does. Ollama v0.14 and later has native compatibility with that API format, too, so there's no translation layer or proxy needed. All you need to do is point Claude Code at your own endpoint and it works.

The whole process is incredibly quick and easy. Pull the model, start the container, export your environment variables, launch Claude Code, and you're ready to go.

Using Qwen3 Coder Next for local coding and reverse engineering

It's the best local LLM I've used

I've tried running other models through Claude Code before. Some worked in a technical sense, but the experience was always compromised. Qwen3-Coder-Next is different, as it was trained from the ground up for agentic coding workflows. It can plan multi-step tasks, call tools, edit files, and recover when things go wrong. And you can feel it in how the model responds, especially as it understands and can use the tools presented to it in the Claude Code harness.

The architecture is worth understanding, too. It uses a hybrid attention system called Gated DeltaNet at a 3:1 ratio. 75% of the layers use linear attention that doesn't grow the KV cache, while 25% use standard full attention. What that means in practice is that the context length doesn't gobble up your memory. The model natively supports 256K tokens, and because of the linear attention layers, that context window is actually usable on local hardware without running out of memory.

It's also a non-reasoning model, which is important to be aware of. It doesn't generate thinking blocks or chain-of-thought reasoning. It just gives you a direct answer, fast. For code generation and file editing, that's exactly what I want. I'm not sitting around waiting for the model to think out loud for 30 seconds, inflating the context before it starts writing the function I asked for.

Local LLMs are great for privacy, but this is where the privacy angle stops being a nice-to-have and is, arguably, a hard requirement. When I'm reverse engineering firmware or analyzing binaries, sending that code to a cloud provider is, in many cases, simply not an option. For those who are contracted to complete that work, NDAs and security policies would likely outright prevent it, but for me, I'd rather not deal with the back and forth of trying to convince a cloud model to allow me to reverse engineer a binary. Qwen3 Coder Next doesn't have a problem with me doing it, but I can't say the same for other models that pushed back out of an abundance of caution.

With 170,000 tokens of context, I can feed Claude Code entire decompiled functions, surrounding context from the binary, and detailed instructions about what I'm looking for, without truncation of those inputs. The model can then analyze control flow, identify patterns, suggest annotations, and even generate test cases for specific code paths. It's work that's tedious and time-consuming when done manually, but it's perfect for an agentic LLM with enough context to understand the full picture.

All of this has cut down what used to be hours of boilerplate into a few minutes of iteration, especially because I have enough experience to quickly analyze a binary, know what I need, and direct the model to do exactly what's necessary. Even better, because the model is running locally, there's no API latency, no rate limits, and no usage-based costs eating into a budget. I can run the same analysis loop dozens of times while refining my approach, and the only real bottleneck is my own ability to formulate better prompts. For anyone doing security research or software analysis professionally, that's a big deal.

It's a fantastic local LLM experience

Purpose built intelligence

Qwen3-Coder-Next still can't match the largest cloud models on the most complex reasoning tasks. It'll occasionally suggest something confidently wrong in ways a more capable model wouldn't, and you need to stay sharp with its output, especially in security-sensitive work. But those caveats apply to every coding assistant I've used, cloud-based or not.

What's changed with this model is the floor. For the first time, I have a local setup that I reach for because it's genuinely useful, not because I'm trying to prove a point about privacy or self-hosting. It's fast, it handles real work, and it runs on a box sitting on my desk. At over $3,000 for the PGX, it's not cheap, but for anyone doing coding or security research professionally, it pays for itself. Claude Code gives the model structure and tools, and Qwen3-Coder-Next brings the capability and intelligence.

You can run Qwen3-Coder-Next on lower VRAM systems with decent results by offloading the MoE layers to system RAM, and I highly recommend trying it out if you're interested. I've been incredibly impressed by how capable it is, and I think you will be too.

URL: https://www.xda-developers.com/finally-found-local-llm-want-use-coding/

⇱ I finally found a local LLM I actually want to use for coding

The hardware makes it possible

Connecting Claude Code is incredibly easy

Running the model and setting up Claude Code

Both are incredibly easy

Using Qwen3 Coder Next for local coding and reverse engineering

It's the best local LLM I've used

It's a fantastic local LLM experience

Purpose built intelligence

URL: https://www.xda-developers.com/finally-found-local-llm-want-use-coding/

⇱ I finally found a local LLM I actually want to use for coding

The hardware makes it possible

Connecting Claude Code is incredibly easy

Running the model and setting up Claude Code

Both are incredibly easy

Using Qwen3 Coder Next for local coding and reverse engineering

It's the best local LLM I've used

Subscribe for hands-on local LLM guides and hacks

It's a fantastic local LLM experience

Purpose built intelligence