📚 Related: Qwen 3.6 Local Guide · Is Qwen Going Closed? Open Weights vs Frontier (2026) · Qwen Models Guide · VRAM Requirements
You’re running Qwen 3.5 in production, or you’ve been about to install it, and the lineup looks like it’s been overtaken. 3.6 dropped in April. 3.7-Max launched in May. The threads on r/LocalLLaMA moved on. Is 3.5 still the right pick, or are you about to deploy something that’s already a generation behind?
The honest answer is: it depends on which tier, and the picture is messier than “just use the newer one.” 3.6 only replaced two of the four 3.5 tiers: the 27B dense and the 35B-A3B MoE. The big MoEs (122B-A10B and the 397B-A17B flagship) have no 3.6 equivalent. The next-generation 3.7 went closed and API-only: Max, Plus, and the VLA robotics model are paid endpoints, not open weights. The 3.6 open mid-tier sits where it landed in April.
That makes 3.5 the stable open workhorse most local-AI rigs are actually running today. This guide is for you whether you’re on 3.5 already, picking between 3.5 and 3.6, or just trying to figure out where your tier sits in the current Qwen split.
The Two-Track Reality
For context on the broader Qwen split: see the Qwen open weights vs closed frontier (2026) breakdown for the strategic picture. The short version that matters here:
- Closed frontier (3.7 generation, API-only): 3.7-Max (text reasoning, May 19), Qwen-VLA (robotics, May 29), 3.7-Plus (multimodal agent, June 1). Three closed releases in under a month. No open weights as of mid-June 2026.
- Open mid-tier (3.5 and 3.6, on Hugging Face under Apache 2.0): 3.6-27B and 3.6-35B-A3B (April), plus the full 3.5 lineup (27B / 35B-A3B / 122B-A10B / 397B-A17B). What you can actually run locally.
So when this guide says “3.5 is the stable open workhorse,” it means within the open mid-tier — the only tier where any of these models are runnable locally. The 3.7-tier paid endpoints are a separate decision about whether to call an API, not about which local model to run.
3.5 vs 3.6: Which Should You Run?
The decision splits cleanly by tier:
| Your tier | The honest pick | Why |
|---|---|---|
| 8GB-24GB GPU, want the MoE (35B-A3B class) | Qwen 3.6-35B-A3B is the upgrade for text | 3.6’s benchmark deltas vs 3.5 at this size are real: SWE-bench Verified 70.0 → 73.4, Terminal-Bench 2.0 40.5 → 51.5, MCPMark (tool use) 27.0 → 37.0 (per Unsloth’s comparison; treat as Qwen-reported, not independent). If you’re starting fresh in this class, 3.6 is the better call. If you’re already on 3.5-35B-A3B and stable, the upgrade urgency is “real but not on-fire.” |
| 24GB GPU, primary task is sustained coding | Qwen 3.5-27B dense is still the strongest open coder Qwen ships | Qwen 3.6’s 27B dense is the same size on a similar architecture and is the better newer model on paper, but as of mid-2026 the 3.5-27B remains the more thoroughly battle-tested option for sustained-coding workloads. Run 3.5-27B if your harness is built around it; consider 3.6-27B for fresh deployments. |
| 48GB+ unified memory / multi-GPU, want the bigger MoE | Qwen 3.5-122B-A10B has no 3.6 replacement | Nothing in the 3.6 lineup sits at this tier yet. Heavy tool-use workflows on Mac Studio (96GB+) or dual-GPU rigs are 3.5-122B-A10B territory by default. |
| 128GB+ unified memory / server hardware | Qwen 3.5-397B-A17B flagship, no 3.6 replacement | Same answer at the top tier: 3.6 didn’t ship a flagship. If you have the hardware to run 397B, 3.5 is what you run. |
| Need vision and you use Ollama specifically | Qwen 3.5 (mature vision path), or 3.6 with llama.cpp/LM Studio/vLLM/SGLang | The 3.6 vision split (separate mmproj file) caused friction in Ollama earlier in the year; the 3.5 vision path is settled. For non-Ollama runtimes, 3.6 vision is fine. |
The simple version:
- If your tier has both a 3.5 and a 3.6 option, prefer 3.6 for fresh deployments, don’t rush an existing 3.5 install you’re happy with.
- If your tier (122B-A10B or 397B-A17B) only exists in 3.5, run 3.5. There’s no other open option from Qwen at that scale right now.
- If you want the frontier, the 3.7 generation is API-only, not a local option.
The Qwen 3.5 Lineup
Four open-weights models, all Apache 2.0, all natively multimodal (text, image, video), all on 262K context (1M via YaRN), all with thinking/non-thinking modes:
| Model | Total Params | Active Params | Architecture | Context | Q4 GGUF Size |
|---|---|---|---|---|---|
| Qwen3.5-27B | 27B | 27B (dense) | Dense + Hybrid Attention | 262K | ~17 GB |
| Qwen3.5-35B-A3B | 35B | 3B | MoE + Hybrid Attention | 262K | ~22 GB |
| Qwen3.5-122B-A10B | 122B | 10B | MoE + Hybrid Attention | 262K | ~70 GB |
| Qwen3.5-397B-A17B | 397B | 17B | MoE + Hybrid Attention | 262K | ~214 GB |
The architecture is the same hybrid of Gated DeltaNet (linear attention) and full attention in a 3:1 ratio — three DeltaNet layers for every full-attention layer. Linear attention scales near-linearly with sequence length, which is why these models hold 256K context without the typical speed cliff.
Base repos all live under the official Qwen Hugging Face org: Qwen3.5-27B, Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B. For GGUF, the established sources are unsloth and bartowski.
Which 3.5 Model on Which GPU
| Your Hardware | VRAM | Best 3.5 Pick | Quantization | Speed expectation |
|---|---|---|---|---|
| RTX 3060 / 4060 (8GB) | 8 GB | 35B-A3B | Q2-Q3 (tight) | Usable but slow |
| RTX 3060 (12GB) | 12 GB | 35B-A3B | Q4_K_M | ~30-40 tok/s |
| RTX 5060 Ti / 5080 (16GB) | 16 GB | 35B-A3B | Q4_K_L or Q4_K_XL | ~40-60 tok/s |
| RTX 4090 / 3090 (24GB) | 24 GB | 27B at Q4 (coder) or 35B-A3B at Q8 (default) | Q4_K_M / Q8 | 20-60 tok/s |
| A6000 / dual GPU (48GB) | 48 GB | 27B at Q8 or 122B-A10B at Q4 | Q8 / Q4_K_M | 15-35 tok/s |
| Mac M4 Max (64GB) | 64 GB | 122B-A10B | Q4_K_M | Mac-bench dependent |
| Mac Ultra / Strix Halo (128GB+) | 128 GB+ | 397B-A17B | Q4 | Server-class |
Speed expectations are community-reported and vary with quant choice, harness, and prompt length. For exact figures on your config, the VRAM Calculator and the VRAM requirements guide are the reference reads.
The interesting tier is 16GB. The MoE architecture means the 35B-A3B activates only 3B parameters per token, so you get 35B-class breadth from inference speeds that look like a small model. The RTX 5060 Ti 16GB and RTX 5080 16GB both run it at Q4 comfortably.
On a 24GB card, the choice is genuine: 27B at Q4_K_M (~17GB, room to spare for other tools and the densest reasoning per token in the family) versus 35B-A3B at Q8 (~22GB, better quant quality, faster generation, better tool use). If you do both sustained coding and agentic work, both fit in Ollama. Hot-swap between them.
The Models, in Detail
The 35B-A3B — the default
The 35B-A3B is the model most people think of when they hear “Qwen 3.5.” 35 billion total parameters, 3 billion active per token. Runs at small-model speeds, draws on large-model knowledge.
Community-reported throughput (Unsloth and r/LocalLLaMA threads):
| GPU / Quant | Token gen | Prompt processing |
|---|---|---|
| RTX 5090, GPTQ-Int4 (vLLM, fp8 KV) | ~194 tok/s | ~7,026 tok/s |
| AMD R9700, Q4_K_XL (Vulkan) | ~127 tok/s | ~2,713 tok/s |
| DGX Spark, Q5 (UD-Q5_K_XL) | ~58.6 tok/s | ~1,861 tok/s |
| Tesla V100 32GB, GGUF | ~38.4 tok/s | ~570 tok/s |
The V100 number is worth highlighting: that’s a used card you can find for $300-400, running a 35B-class MoE at near-30 tok/s. The architecture is the unlock.
Qwen-reported benchmarks vs the comparable competition:
| Benchmark | 35B-A3B | GPT-5 mini | Claude Sonnet 4.5 |
|---|---|---|---|
| MMLU-Pro | 85.3 | 83.7 | 80.8 |
| GPQA Diamond | 84.2 | 82.8 | 80.1 |
| SWE-bench Verified | 69.2 | 72.0 | 77.2 |
| BFCL-V4 (Tool Use) | 67.3 | 55.5 | 54.8 |
| BrowseComp | 61.0 | 48.1 | 41.1 |
| TAU2-Bench (Agentic) | 81.2 | — | — |
Scores from Qwen’s published model cards. Sonnet 4.5 SWE-bench from Anthropic. Treat all of these as vendor-reported and check your own workload.
The BFCL-V4 number (67.3 vs GPT-5 mini’s 55.5) is what made this the default agentic-coding pick locally. For local AI agents and function calling, the 35B-A3B is the open-weight model most people are pointing harnesses at.
Caveat that the existing community feedback has surfaced: the 35B-A3B is lower than the 27B dense on sustained coding (SWE-bench 69.2 vs 72.4, LiveCodeBench 74.6 vs 80.7) and there are r/LocalLLaMA reports of broken diffs and hallucinated APIs on long-running coding sessions. If sustained coding is your primary use case, the 27B is the better 3.5 pick.
If you’re starting fresh in this size class today, the Qwen 3.6-35B-A3B is the upgrade — same architecture, better benchmarks across the agentic / coding metrics that matter most locally. The 3.5 version stays useful for existing deployments and for vision through Ollama specifically.
The 27B dense — the coder’s pick
The 27B is the dense model in the family. Every parameter is active on every token, which means slower generation than the MoE but deeper reasoning per token.
| Benchmark | 27B | 35B-A3B | GPT-5 mini |
|---|---|---|---|
| SWE-bench Verified | 72.4 | 69.2 | 72.0 |
| LiveCodeBench v6 | 80.7 | 74.6 | 80.5 |
| Terminal-Bench 2 | 41.6 | 40.5 | 31.9 |
| HMMT Feb 2025 (Math) | 92.0 | 89.0 | 89.2 |
72.4 SWE-bench Verified matches GPT-5 mini exactly. Terminal-Bench 2 at 41.6 beats it by 30%. This is a competitive coding model in a footprint that fits a single 24GB GPU at Q4 (~17GB), or A6000 territory at Q8 (~30GB).
For sustained coding work on a 24GB card, the 27B dense is the pick over the 35B-A3B. The 35B-A3B’s speed-and-breadth wins are real, but the diff fidelity gap shows up on long sessions.
The 122B-A10B — the Mac Studio tier
122 billion total parameters, 10 billion active. Built for machines with 48GB+ unified memory.
| Benchmark | 122B-A10B | GPT-5 mini | Claude Sonnet 4.5 |
|---|---|---|---|
| MMLU-Pro | 86.7 | 83.7 | 80.8 |
| SWE-bench Verified | 72.0 | 72.0 | 77.2 |
| Terminal-Bench 2 | 49.4 | 31.9 | 18.7 |
| BFCL-V4 (Tool Use) | 72.2 | 55.5 | 54.8 |
| BrowseComp | 63.8 | 48.1 | 41.1 |
Scores from Qwen’s model card. Sonnet 4.5 SWE-bench from Anthropic.
Terminal-Bench 2 at 49.4 vs GPT-5 mini’s 31.9 isn’t a close race. BFCL-V4 at 72.2 vs 55.5 is a 30% margin on tool use. The 122B-A10B at Q4 (~70 GB) is what justifies a Mac Studio with 96GB+ unified memory, or a dual-GPU rig totaling 80GB+, or DGX-Spark-class server hardware.
There is no 3.6 equivalent at this tier. If you have the hardware, the 122B-A10B is the strongest local open option Qwen ships in mid-2026, full stop.
The 397B-A17B flagship
The top of the open Qwen lineup. ~214 GB at Q4 — Mac Ultra with 192GB+ unified memory, dedicated server hardware, or DGX-Spark / Strix Halo 128GB territory (tight at Q4).
Reportedly runs at 45 tok/s on 8x H100s. Beats GPT-5.2 on IFBench (76.5 vs 75.4, highest of any model tested) and MultiChallenge (67.6 vs 57.9). Trails on AIME 2026 (91.3 vs 96.7) and SWE-bench (76.4 vs 80.0). Competitive with frontier closed systems, but the hardware ask is real.
For most local-AI users, the 122B or 35B-A3B cover the same ground at a fraction of the hardware cost. The 397B is for the people running it for specific frontier-adjacent workloads that justify the rig.
Backend Status (mid-June 2026)
The launch-week rough edges are gone. Current backend reality:
- Ollama (0.30.x) —
ollama run qwen3.5:35bandollama run qwen3.5:27b-q4_K_Mwork out of the box. Multimodal (vision) works on the 3.5 path. The 3.6 vision split (separate mmproj) caused earlier friction in Ollama specifically; the 3.5 vision flow doesn’t have that complication. - llama.cpp (current mainline) — Qwen 3.5 has been stable across builds for months at this point. Speculative decoding for MoE (MTP) landed in mainline via PR #22673 on May 16, 2026, so the older
llama-mtpfork workaround is no longer needed. For multi-GPU 27B configs, the early multi-GPU graph-split issues (#19860 / #19866) have been fixed for months — any recent mainline build is fine. - vLLM (0.23.0+) — full 3.5 support including FP8, GPTQ, MTP speculative decoding, and the GDN (Gated DeltaNet) reasoning parser. The single-model-locked + 90% VRAM pre-allocation behavior of vLLM still applies; for the platform reality across runtimes, see the llama.cpp vs Ollama vs vLLM guide.
- SGLang — Qwen team recommends it for production serving. MTP available; GDN support.
- LM Studio — covers Qwen 3.5 fine for the desktop-GUI use case.
GGUF sources to know:
- unsloth/Qwen3.5-35B-A3B-GGUF and unsloth/Qwen3.5-27B-GGUF — the Unsloth dynamic-quant variants, with the UD-Q4_K_M / UD-Q5_K_XL etc. tags.
- bartowski/Qwen_Qwen3.5-35B-A3B-GGUF and bartowski/Qwen_Qwen3.5-27B-GGUF — bartowski’s imatrix quants.
- mradermacher maintains static and imatrix quant repos across the catalog too.
For the Qwen-org direct paths: only the base repos (e.g. Qwen/Qwen3.5-35B-A3B) exist under the official Qwen org. There is no Qwen/Qwen3.5-35B-A3B-GGUF. The GGUF files live under the quantizer accounts above. If a guide tells you to run llama-server -hf Qwen/Qwen3.5-35B-A3B-GGUF:Q4_K_M, the path is wrong; use the Unsloth or bartowski repo instead.
Quantization Notes
Unsloth’s benchmarks on the 35B-A3B (vendor-attributed):
| Quant | Top-1 Token Agreement | Notes |
|---|---|---|
| Q4_K_L | 89% | Best quality retention at 4-bit |
| MXFP4 | Good (PPL +1.38) | New format, fast |
| UD-Q4_K_XL | 79.4% | Lowest quality at 4-bit |
Q4_K_L retains the highest quality at 4-bit. If your VRAM fits it, prefer Q4_K_L over Q4_K_XL.
For the 397B, Unsloth reports UD-Q4_K_XL stays within 1 point of accuracy on most benchmarks despite cutting file size by ~500GB. At flagship scale, aggressive quantization hurts less.
Other findings worth knowing:
- 8-bit KV cache improves output quality when running 4-bit model quants.
- FP8 weights are available officially for every 3.5 size, giving a clean middle ground between full precision and GGUF quants.
- For the deeper background, the quantization decision frame in the Qwen 3.6 / RTX 3090 piece carries over to 3.5. The trade-off shape is the same.
How to Run It
Ollama (simplest):
ollama run qwen3.5:35b
# or for the 27B dense (coder pick):
ollama run qwen3.5:27b-q4_K_M
Multimodal works out of the box on the 3.5 path.
llama.cpp (most control):
# Pull from unsloth (the GGUF source that actually exists)
./llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
--jinja --reasoning-format deepseek -ngl 99
Any recent mainline build is fine. The multi-GPU graph-split fixes are months old; MTP speculative decoding is in mainline since May 16.
Disable thinking by default (saves tokens on simple tasks):
./llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
--jinja --chat-template-kwargs '{"enable_thinking": false}' -ngl 99
You can also add /think or /nothink to individual messages to toggle per-request.
Recommended sampling parameters (from Qwen):
- General thinking mode: temperature 1.0, top_p 0.95, top_k 20, presence_penalty 1.5
- Coding mode: temperature 0.6, top_p 0.95, top_k 20, presence_penalty 0.0
- Non-thinking instruct: temperature 0.7, top_p 0.8, top_k 20, presence_penalty 1.5
Known Issues, Mid-2026 Status
Most launch-era 3.5 rough edges are resolved. A short list of what’s still worth knowing:
- DeltaNet CUDA throughput. The hybrid DeltaNet architecture launched with measurable CUDA-side overhead vs standard attention — community reports flagged roughly a 35% gap vs older MoE on some configs. Throughput has improved across llama.cpp builds since but isn’t fully optimized; running a recent mainline build matters here more than for simpler architectures.
- Repetition / looping. If you see it, raise
--presence-penaltyto 1.5 (up to 2.0). - Vision via Ollama on 3.5 works. The vision friction reports you may have seen were specifically about 3.6’s separate mmproj file in Ollama; 3.5’s multimodal flow doesn’t have the same split.
- Multi-GPU 27B. Launch-week dual-3090 crashes (#19860, fixed via #19866 in February) are long past; any current mainline build handles this.
Qwen 3.5 vs Qwen 3 — Historical Context
For readers comparing against the older Qwen 3 lineup:
| Qwen 3 | Qwen 3.5 | |
|---|---|---|
| Dense model | 32B | 27B (denser, better benchmarks) |
| Small MoE | 30B-A3B | 35B-A3B (5B more total params) |
| Medium MoE | — | 122B-A10B (new tier) |
| Large MoE | 235B-A22B | 397B-A17B |
| Architecture | Standard attention | Hybrid DeltaNet + attention (3:1) |
| Multimodal | Separate VL models | Native in all models |
| Context | 128K | 262K (1M via YaRN) |
| FP8 weights | Community only | Official |
| Vocabulary | 152K tokens | 250K tokens |
The 35B-A3B beat the previous flagship Qwen3-235B on language, vision, and agent benchmarks despite being about 7× smaller in total parameters. The architectural shift to DeltaNet plus broader vocabulary is the reason.
The broader family tree across versions lives in the Qwen Models Guide.
Bottom Line
For the 35B-A3B / 27B dense class (the two tiers 3.6 also covers):
- Fresh deployment? Start with 3.6 for the benchmark deltas (Terminal-Bench, MCPMark, SWE-bench all moved the right direction).
- Already on stable 3.5? Don’t rush. The upgrade is “worth doing eventually,” not “fix-it-tomorrow.”
- Need vision through Ollama specifically? 3.5 is the smoother path right now.
For the 122B-A10B and 397B-A17B tiers (no 3.6 equivalent):
- Run 3.5. There’s no other open option from Qwen at these scales right now.
- The 122B-A10B remains the strongest local open Qwen for heavy tool-use workflows on 48GB+ unified memory.
- The 397B-A17B is the open frontier if you have server hardware.
Don’t expect a 3.7 open equivalent any time soon. As of mid-June 2026, the 3.7 generation is closed and API-only across Max, Plus, and VLA, with no Qwen3.7-* repository on the official Hugging Face org. The open weights vs closed frontier (2026) breakdown tracks that question.
# 35B-A3B default, still works fine on Ollama
ollama run qwen3.5:35b
# 27B dense, sustained-coding pick
ollama run qwen3.5:27b-q4_K_M
# llama.cpp direct (use the GGUF source that actually exists)
./llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M --jinja -ngl 99
# Upgrading the 35B-A3B class? Step to 3.6:
ollama run qwen3.6
Related Guides
- Qwen 3.6 Local Guide — the upgrade for the 27B and 35B-A3B tiers
- Is Qwen Going Closed? Open Weights vs Frontier (2026) — the two-track strategic picture
- Qwen Models Guide — the family tree across versions
- llama.cpp vs Ollama vs vLLM (2026) — runtime decision for serving
- VRAM Requirements for Local LLMs
- Best Local Models for OpenClaw / agentic harnesses
- Best Local Coding Models (2026)
- Function Calling with Local LLMs
- Fix Slow Qwen 3.6-27B on RTX 3090 — quantization decision frame applies to 3.5 too
- Ollama vs LM Studio
- Best Local LLMs for Mac (2026)
- Ollama Troubleshooting Guide
Get notified when we publish new guides.
Subscribe — free, no spam