Qwen3 Coder Next 80B A3B: what it takes to run it locally
By | Updated: March 4, 2026
👁 qwen3 coder next building pc for local use
Direct answer first: Qwen3 Coder Next 80B A3B is one of the most hardware-friendly 80B-class coding models released so far. Thanks to its MoE design with roughly 3B active parameters, a single high-VRAM GPU can run it at full 256k context, and even dual consumer GPUs can handle the 3-bit version comfortably. VRAM, not raw compute, is the main constraint.
This model is clearly aimed at coding agents and long-running local development workflows. The timing is good, especially if you are experimenting with agent frameworks like OpenClaw . We jumped straight into hardware testing to see what it actually takes to run it without compromises.
All tests below were done on Ubuntu 24.04 using NVIDIA driver 590.48.01, CUDA 12.8, and llama.cpp build 7932.
Test system and environment
The primary test system was an NVIDIA RTX PRO 6000 Blackwell Workstation Edition with 96 GB of VRAM. Driver version was 590.48.01, CUDA 12.8, running on Ubuntu 24.04. llama.cpp was built from the latest source at the time of testing.
This setup represents the current upper bound for single-GPU local inference, but the results scale down better than you might expect.
VRAM requirements across context sizes
The first question every local LLM user asks is simple: how much VRAM do I need, and how badly does context length hurt me. The short answer is that Qwen3 Coder Next behaves unusually well.
The table below shows measured VRAM usage for both Q4_K_XL and Q3_K_XL quantizations across different context lengths.
VRAM usage by context length
| Context (tokens) | Q4_K_XL VRAM (GB) | Q3_K_XL VRAM (GB) |
|---|---|---|
| 4k | 47 | 37 |
| 8k | 47 | 37 |
| 16k | 48 | 37 |
| 32k | 48 | 38 |
| 45k | 48 | 38 |
| 57k | 48 | 38 |
| 65k | 49 | 39 |
| 86k | 49 | 39 |
| 131k | 50 | 40 |
| 256k | 54 | 44 |
The key takeaway is that the VRAM delta from 4k to 256k context is small. For the 4-bit model, the entire jump costs about 7 GB. This is a direct benefit of the hybrid MoE architecture. KV cache growth is present, but it is not the runaway problem we see on dense 70B or 80B models.
From a hardware planning perspective, this changes the equation. Once you can load the model, maxing out context becomes practical rather than aspirational.
What hardware actually makes sense
If you want to run this model seriously, especially for agentic or long-session coding use, you should aim to run it as close to the maximum context as possible.
A single RTX PRO 6000 is the cleanest solution if budget allows. At around $8000, it gives you headroom for the 4-bit model at full 256k context with no tuning pain. The RTX PRO 5000 72 GB version is another viable single-GPU option if you can find one, typically closer to $7000, though context headroom is tighter.
These cards are expensive, but they remove complexity. No PCIe juggling, no inter-GPU latency, no split KV cache issues.
Inference speed on RTX PRO 6000
Speed was measured using llama.cpp with flash attention enabled. Prompt processing and token generation were tested across a wide range of context sizes.
NVIDIA RTX PRO 6000 running Qwen3 Coder Next 80B A3B locally, enabling full 256k context with stable performance for long coding and agent workloads.
To make this easier to read, the table below summarizes performance using context length in kilotokens.
Speed vs context on RTX PRO 6000 (Q4_K)
| Context (k) | Prompt processing (t/s) | Token generation (t/s) |
|---|---|---|
| 4k | 2920 | 85.6 |
| 8k | 2854 | 85.3 |
| 16k | 2806 | 84.4 |
| 32k | 2648 | 82.4 |
| 45k | 2552 | 80.9 |
| 57k | 2438 | 79.4 |
| 65k | 2380 | 78.1 |
| 86k | 2214 | 76.0 |
| 131k | 1960 | 71.7 |
| 256k | 1472 | 61.0 |
The important observation is that performance degradation is gradual. Even at 256k context, prompt processing stays above 1400 tokens per second, which is exceptionally high for an 80B-class model. Token generation drops, but remains usable for real interactive work.
The gap between 4k and 256k context is much smaller than most people expect. This makes long-context workflows practical instead of painful.
Unified memory platforms: Strix Halo, DGX Spark, Apple Silicon
Unified memory systems are a natural fit for this model. Because Qwen3 Coder Next only activates around 3B parameters per step, memory bandwidth and latency matter more than raw VRAM size alone.
Strix Halo Beelink GTR9 Pro
Strix Halo stands out here. The 64 GB configuration is enough to load the model with large context at a far more approachable price point than workstation GPUs. Community testing reports around 37 tokens per second at 32k context, with prompt processing near 500 tokens per second. That is slower than a PRO 6000, but still very usable for development and agent workflows.
DGX Spark and Apple M-series systems also work, especially if you already own them, but Strix Halo currently offers the best balance of price, capacity, and simplicity for this class of model.
Dual GPU setups for budget-conscious builds
If you are optimizing for performance per dollar, dual consumer GPUs are still very relevant.
In our testing, dual RTX 3090 and dual RTX 4090 setups were not able to run the standard llama-bench test in 4-bit (Q4_K_XL). However, this is not the end of the story.
We also tested the same hardware using llama-server with the --fit parameter, and in this scenario the model runs flawlessly and with very good speed, even in 4-bit. The --fit path avoids the hard failure seen in llama-bench and makes Q4_K_XL practical on dual consumer GPUs.
We used the following command to run the benchmark:
./llama-server \
--port 10000 \
--model Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
--flash-attn on \
--fit on --fit-ctx 131072 \
--fit-target 128 \
--temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja
Measured results with --fit enabled were as follows:
Dual RTX 3090 / 4090 – Q4_K_XL with --fit
| Context (k) | Prompt processing (t/s) | Token generation (t/s) |
|---|---|---|
32k (--fit-ctx 32768) |
1165.47 | 33.17 |
64k (--fit-ctx 65536) |
1089.12 | 29.55 |
131k (--fit-ctx 131072) |
987.33 | 24.78 |
As you can see, this is perfectly usable. Prompt processing remains strong, and generation speed is more than sufficient for interactive coding and agentic workloads. If the model quality holds up this is genuinely great news for local use.
For roughly $1600 total for a dual RTX 3090 setup, you can run a long-context, agent-capable local coding model without resorting to workstation-class hardware.
The Q3_K_XL results remain unchanged and continue to be an excellent, lower-friction option if you want maximum headroom with minimal tuning, but it is important to note that 4-bit is absolutely viable on dual consumer GPUs when using the right inference path.
Dual RTX 3090 speed results (Q3_K)
| Context (k) | Prompt processing (t/s) | Token generation (t/s) |
|---|---|---|
| 32k | 1076 | 70.9 |
| 45k | 1011 | 69.3 |
| 57k | 966 | 67.7 |
| 65k | 939 | 66.4 |
| 86k | 865 | 63.9 |
| 131k | 750 | 58.7 |
| 256k | 535 | 47.7 |
Performance scales down predictably with context, but remains usable even at the maximum window. For users willing to deal with multi-GPU setup complexity, this is one of the best value paths today.
Final thoughts
Qwen3 Coder Next 80B A3B is unusually friendly to local inference. Its MoE design keeps VRAM growth under control, making long-context usage realistic on hardware that enthusiasts can actually buy.
If you want the cleanest experience, a single high-VRAM workstation GPU is hard to beat. If you care about value, unified memory systems and dual consumer GPUs make this model accessible without extreme spending.
Most importantly, this is one of the rare large coding models where running at near-maximum context makes sense, both technically and economically. For local coding agents and long-session development, that matters more than raw benchmark numbers.
