VOOZH about

URL: https://www.hardware-corner.net/qwen3-coder-next-hardware-requirements/

⇱ Qwen3 Coder Next 80B A3B: what it takes to run it locally


Qwen3 Coder Next 80B A3B: what it takes to run it locally

By | Updated: March 4, 2026

👁 qwen3 coder next building pc for local use

Direct answer first: Qwen3 Coder Next 80B A3B is one of the most hardware-friendly 80B-class coding models released so far. Thanks to its MoE design with roughly 3B active parameters, a single high-VRAM GPU can run it at full 256k context, and even dual consumer GPUs can handle the 3-bit version comfortably. VRAM, not raw compute, is the main constraint.

This model is clearly aimed at coding agents and long-running local development workflows. The timing is good, especially if you are experimenting with agent frameworks like OpenClaw . We jumped straight into hardware testing to see what it actually takes to run it without compromises.

All tests below were done on Ubuntu 24.04 using NVIDIA driver 590.48.01, CUDA 12.8, and llama.cpp build 7932.

Test system and environment

The primary test system was an NVIDIA RTX PRO 6000 Blackwell Workstation Edition with 96 GB of VRAM. Driver version was 590.48.01, CUDA 12.8, running on Ubuntu 24.04. llama.cpp was built from the latest source at the time of testing.

This setup represents the current upper bound for single-GPU local inference, but the results scale down better than you might expect.

VRAM requirements across context sizes

The first question every local LLM user asks is simple: how much VRAM do I need, and how badly does context length hurt me. The short answer is that Qwen3 Coder Next behaves unusually well.

The table below shows measured VRAM usage for both Q4_K_XL and Q3_K_XL quantizations across different context lengths.

VRAM usage by context length

Context (tokens) Q4_K_XL VRAM (GB) Q3_K_XL VRAM (GB)
4k 47 37
8k 47 37
16k 48 37
32k 48 38
45k 48 38
57k 48 38
65k 49 39
86k 49 39
131k 50 40
256k 54 44

The key takeaway is that the VRAM delta from 4k to 256k context is small. For the 4-bit model, the entire jump costs about 7 GB. This is a direct benefit of the hybrid MoE architecture. KV cache growth is present, but it is not the runaway problem we see on dense 70B or 80B models.

From a hardware planning perspective, this changes the equation. Once you can load the model, maxing out context becomes practical rather than aspirational.

What hardware actually makes sense

If you want to run this model seriously, especially for agentic or long-session coding use, you should aim to run it as close to the maximum context as possible.

A single RTX PRO 6000 is the cleanest solution if budget allows. At around $8000, it gives you headroom for the 4-bit model at full 256k context with no tuning pain. The RTX PRO 5000 72 GB version is another viable single-GPU option if you can find one, typically closer to $7000, though context headroom is tighter.

These cards are expensive, but they remove complexity. No PCIe juggling, no inter-GPU latency, no split KV cache issues.

Inference speed on RTX PRO 6000

Speed was measured using llama.cpp with flash attention enabled. Prompt processing and token generation were tested across a wide range of context sizes.

👁 rtx pro 6000 inside a desktop workstation runnign qwen3 coder next 80b lllm

NVIDIA RTX PRO 6000 running Qwen3 Coder Next 80B A3B locally, enabling full 256k context with stable performance for long coding and agent workloads.

To make this easier to read, the table below summarizes performance using context length in kilotokens.

Speed vs context on RTX PRO 6000 (Q4_K)

Context (k) Prompt processing (t/s) Token generation (t/s)
4k 2920 85.6
8k 2854 85.3
16k 2806 84.4
32k 2648 82.4
45k 2552 80.9
57k 2438 79.4
65k 2380 78.1
86k 2214 76.0
131k 1960 71.7
256k 1472 61.0

The important observation is that performance degradation is gradual. Even at 256k context, prompt processing stays above 1400 tokens per second, which is exceptionally high for an 80B-class model. Token generation drops, but remains usable for real interactive work.

The gap between 4k and 256k context is much smaller than most people expect. This makes long-context workflows practical instead of painful.

Unified memory platforms: Strix Halo, DGX Spark, Apple Silicon

Unified memory systems are a natural fit for this model. Because Qwen3 Coder Next only activates around 3B parameters per step, memory bandwidth and latency matter more than raw VRAM size alone.

Strix Halo stands out here. The 64 GB configuration is enough to load the model with large context at a far more approachable price point than workstation GPUs. Community testing reports around 37 tokens per second at 32k context, with prompt processing near 500 tokens per second. That is slower than a PRO 6000, but still very usable for development and agent workflows.

DGX Spark and Apple M-series systems also work, especially if you already own them, but Strix Halo currently offers the best balance of price, capacity, and simplicity for this class of model.

Dual GPU setups for budget-conscious builds

If you are optimizing for performance per dollar, dual consumer GPUs are still very relevant.

In our testing, dual RTX 3090 and dual RTX 4090 setups were not able to run the standard llama-bench test in 4-bit (Q4_K_XL). However, this is not the end of the story.

We also tested the same hardware using llama-server with the --fit parameter, and in this scenario the model runs flawlessly and with very good speed, even in 4-bit. The --fit path avoids the hard failure seen in llama-bench and makes Q4_K_XL practical on dual consumer GPUs.

We used the following command to run the benchmark:

./llama-server \
--port 10000 \
--model Qwen3-Coder-Next-UD-Q4_K_XL.gguf \
--flash-attn on \
--fit on --fit-ctx 131072 \
--fit-target 128 \
--temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40 --jinja

Measured results with --fit enabled were as follows:

Dual RTX 3090 / 4090 – Q4_K_XL with --fit

Context (k) Prompt processing (t/s) Token generation (t/s)
32k (--fit-ctx 32768) 1165.47 33.17
64k (--fit-ctx 65536) 1089.12 29.55
131k (--fit-ctx 131072) 987.33 24.78

As you can see, this is perfectly usable. Prompt processing remains strong, and generation speed is more than sufficient for interactive coding and agentic workloads. If the model quality holds up this is genuinely great news for local use.

For roughly $1600 total for a dual RTX 3090 setup, you can run a long-context, agent-capable local coding model without resorting to workstation-class hardware.

The Q3_K_XL results remain unchanged and continue to be an excellent, lower-friction option if you want maximum headroom with minimal tuning, but it is important to note that 4-bit is absolutely viable on dual consumer GPUs when using the right inference path.

Dual RTX 3090 speed results (Q3_K)

Context (k) Prompt processing (t/s) Token generation (t/s)
32k 1076 70.9
45k 1011 69.3
57k 966 67.7
65k 939 66.4
86k 865 63.9
131k 750 58.7
256k 535 47.7

Performance scales down predictably with context, but remains usable even at the maximum window. For users willing to deal with multi-GPU setup complexity, this is one of the best value paths today.

Final thoughts

Qwen3 Coder Next 80B A3B is unusually friendly to local inference. Its MoE design keeps VRAM growth under control, making long-context usage realistic on hardware that enthusiasts can actually buy.

If you want the cleanest experience, a single high-VRAM workstation GPU is hard to beat. If you care about value, unified memory systems and dual consumer GPUs make this model accessible without extreme spending.

Most importantly, this is one of the rare large coding models where running at near-maximum context makes sense, both technically and economically. For local coding agents and long-session development, that matters more than raw benchmark numbers.

Read more: Run LLMs Locally