VOOZH about

URL: https://insiderllm.com/guides/qwen-3-5-local-ai-guide/

⇱ Best Qwen 3.5 Setup: When to Stay vs Move to 3.6 (2026) | InsiderLLM


📚 Related: Qwen 3.6 Local Guide · Is Qwen Going Closed? Open Weights vs Frontier (2026) · Qwen Models Guide · VRAM Requirements

You’re running Qwen 3.5 in production, or you’ve been about to install it, and the lineup looks like it’s been overtaken. 3.6 dropped in April. 3.7-Max launched in May. The threads on r/LocalLLaMA moved on. Is 3.5 still the right pick, or are you about to deploy something that’s already a generation behind?

The honest answer is: it depends on which tier, and the picture is messier than “just use the newer one.” 3.6 only replaced two of the four 3.5 tiers: the 27B dense and the 35B-A3B MoE. The big MoEs (122B-A10B and the 397B-A17B flagship) have no 3.6 equivalent. The next-generation 3.7 went closed and API-only: Max, Plus, and the VLA robotics model are paid endpoints, not open weights. The 3.6 open mid-tier sits where it landed in April.

That makes 3.5 the stable open workhorse most local-AI rigs are actually running today. This guide is for you whether you’re on 3.5 already, picking between 3.5 and 3.6, or just trying to figure out where your tier sits in the current Qwen split.


The Two-Track Reality

For context on the broader Qwen split: see the Qwen open weights vs closed frontier (2026) breakdown for the strategic picture. The short version that matters here:

  • Closed frontier (3.7 generation, API-only): 3.7-Max (text reasoning, May 19), Qwen-VLA (robotics, May 29), 3.7-Plus (multimodal agent, June 1). Three closed releases in under a month. No open weights as of mid-June 2026.
  • Open mid-tier (3.5 and 3.6, on Hugging Face under Apache 2.0): 3.6-27B and 3.6-35B-A3B (April), plus the full 3.5 lineup (27B / 35B-A3B / 122B-A10B / 397B-A17B). What you can actually run locally.

So when this guide says “3.5 is the stable open workhorse,” it means within the open mid-tier — the only tier where any of these models are runnable locally. The 3.7-tier paid endpoints are a separate decision about whether to call an API, not about which local model to run.


3.5 vs 3.6: Which Should You Run?

The decision splits cleanly by tier:

Your tierThe honest pickWhy
8GB-24GB GPU, want the MoE (35B-A3B class)Qwen 3.6-35B-A3B is the upgrade for text3.6’s benchmark deltas vs 3.5 at this size are real: SWE-bench Verified 70.0 → 73.4, Terminal-Bench 2.0 40.5 → 51.5, MCPMark (tool use) 27.0 → 37.0 (per Unsloth’s comparison; treat as Qwen-reported, not independent). If you’re starting fresh in this class, 3.6 is the better call. If you’re already on 3.5-35B-A3B and stable, the upgrade urgency is “real but not on-fire.”
24GB GPU, primary task is sustained codingQwen 3.5-27B dense is still the strongest open coder Qwen shipsQwen 3.6’s 27B dense is the same size on a similar architecture and is the better newer model on paper, but as of mid-2026 the 3.5-27B remains the more thoroughly battle-tested option for sustained-coding workloads. Run 3.5-27B if your harness is built around it; consider 3.6-27B for fresh deployments.
48GB+ unified memory / multi-GPU, want the bigger MoEQwen 3.5-122B-A10B has no 3.6 replacementNothing in the 3.6 lineup sits at this tier yet. Heavy tool-use workflows on Mac Studio (96GB+) or dual-GPU rigs are 3.5-122B-A10B territory by default.
128GB+ unified memory / server hardwareQwen 3.5-397B-A17B flagship, no 3.6 replacementSame answer at the top tier: 3.6 didn’t ship a flagship. If you have the hardware to run 397B, 3.5 is what you run.
Need vision and you use Ollama specificallyQwen 3.5 (mature vision path), or 3.6 with llama.cpp/LM Studio/vLLM/SGLangThe 3.6 vision split (separate mmproj file) caused friction in Ollama earlier in the year; the 3.5 vision path is settled. For non-Ollama runtimes, 3.6 vision is fine.

The simple version:

  • If your tier has both a 3.5 and a 3.6 option, prefer 3.6 for fresh deployments, don’t rush an existing 3.5 install you’re happy with.
  • If your tier (122B-A10B or 397B-A17B) only exists in 3.5, run 3.5. There’s no other open option from Qwen at that scale right now.
  • If you want the frontier, the 3.7 generation is API-only, not a local option.

The Qwen 3.5 Lineup

Four open-weights models, all Apache 2.0, all natively multimodal (text, image, video), all on 262K context (1M via YaRN), all with thinking/non-thinking modes:

ModelTotal ParamsActive ParamsArchitectureContextQ4 GGUF Size
Qwen3.5-27B27B27B (dense)Dense + Hybrid Attention262K~17 GB
Qwen3.5-35B-A3B35B3BMoE + Hybrid Attention262K~22 GB
Qwen3.5-122B-A10B122B10BMoE + Hybrid Attention262K~70 GB
Qwen3.5-397B-A17B397B17BMoE + Hybrid Attention262K~214 GB

The architecture is the same hybrid of Gated DeltaNet (linear attention) and full attention in a 3:1 ratio — three DeltaNet layers for every full-attention layer. Linear attention scales near-linearly with sequence length, which is why these models hold 256K context without the typical speed cliff.

Base repos all live under the official Qwen Hugging Face org: Qwen3.5-27B, Qwen3.5-35B-A3B, Qwen3.5-122B-A10B, Qwen3.5-397B-A17B. For GGUF, the established sources are unsloth and bartowski.


Which 3.5 Model on Which GPU

Your HardwareVRAMBest 3.5 PickQuantizationSpeed expectation
RTX 3060 / 4060 (8GB)8 GB35B-A3BQ2-Q3 (tight)Usable but slow
RTX 3060 (12GB)12 GB35B-A3BQ4_K_M~30-40 tok/s
RTX 5060 Ti / 5080 (16GB)16 GB35B-A3BQ4_K_L or Q4_K_XL~40-60 tok/s
RTX 4090 / 3090 (24GB)24 GB27B at Q4 (coder) or 35B-A3B at Q8 (default)Q4_K_M / Q820-60 tok/s
A6000 / dual GPU (48GB)48 GB27B at Q8 or 122B-A10B at Q4Q8 / Q4_K_M15-35 tok/s
Mac M4 Max (64GB)64 GB122B-A10BQ4_K_MMac-bench dependent
Mac Ultra / Strix Halo (128GB+)128 GB+397B-A17BQ4Server-class

Speed expectations are community-reported and vary with quant choice, harness, and prompt length. For exact figures on your config, the VRAM Calculator and the VRAM requirements guide are the reference reads.

The interesting tier is 16GB. The MoE architecture means the 35B-A3B activates only 3B parameters per token, so you get 35B-class breadth from inference speeds that look like a small model. The RTX 5060 Ti 16GB and RTX 5080 16GB both run it at Q4 comfortably.

On a 24GB card, the choice is genuine: 27B at Q4_K_M (~17GB, room to spare for other tools and the densest reasoning per token in the family) versus 35B-A3B at Q8 (~22GB, better quant quality, faster generation, better tool use). If you do both sustained coding and agentic work, both fit in Ollama. Hot-swap between them.


The Models, in Detail

The 35B-A3B — the default

The 35B-A3B is the model most people think of when they hear “Qwen 3.5.” 35 billion total parameters, 3 billion active per token. Runs at small-model speeds, draws on large-model knowledge.

Community-reported throughput (Unsloth and r/LocalLLaMA threads):

GPU / QuantToken genPrompt processing
RTX 5090, GPTQ-Int4 (vLLM, fp8 KV)~194 tok/s~7,026 tok/s
AMD R9700, Q4_K_XL (Vulkan)~127 tok/s~2,713 tok/s
DGX Spark, Q5 (UD-Q5_K_XL)~58.6 tok/s~1,861 tok/s
Tesla V100 32GB, GGUF~38.4 tok/s~570 tok/s

The V100 number is worth highlighting: that’s a used card you can find for $300-400, running a 35B-class MoE at near-30 tok/s. The architecture is the unlock.

Qwen-reported benchmarks vs the comparable competition:

Benchmark35B-A3BGPT-5 miniClaude Sonnet 4.5
MMLU-Pro85.383.780.8
GPQA Diamond84.282.880.1
SWE-bench Verified69.272.077.2
BFCL-V4 (Tool Use)67.355.554.8
BrowseComp61.048.141.1
TAU2-Bench (Agentic)81.2

Scores from Qwen’s published model cards. Sonnet 4.5 SWE-bench from Anthropic. Treat all of these as vendor-reported and check your own workload.

The BFCL-V4 number (67.3 vs GPT-5 mini’s 55.5) is what made this the default agentic-coding pick locally. For local AI agents and function calling, the 35B-A3B is the open-weight model most people are pointing harnesses at.

Caveat that the existing community feedback has surfaced: the 35B-A3B is lower than the 27B dense on sustained coding (SWE-bench 69.2 vs 72.4, LiveCodeBench 74.6 vs 80.7) and there are r/LocalLLaMA reports of broken diffs and hallucinated APIs on long-running coding sessions. If sustained coding is your primary use case, the 27B is the better 3.5 pick.

If you’re starting fresh in this size class today, the Qwen 3.6-35B-A3B is the upgrade — same architecture, better benchmarks across the agentic / coding metrics that matter most locally. The 3.5 version stays useful for existing deployments and for vision through Ollama specifically.

The 27B dense — the coder’s pick

The 27B is the dense model in the family. Every parameter is active on every token, which means slower generation than the MoE but deeper reasoning per token.

Benchmark27B35B-A3BGPT-5 mini
SWE-bench Verified72.469.272.0
LiveCodeBench v680.774.680.5
Terminal-Bench 241.640.531.9
HMMT Feb 2025 (Math)92.089.089.2

72.4 SWE-bench Verified matches GPT-5 mini exactly. Terminal-Bench 2 at 41.6 beats it by 30%. This is a competitive coding model in a footprint that fits a single 24GB GPU at Q4 (~17GB), or A6000 territory at Q8 (~30GB).

For sustained coding work on a 24GB card, the 27B dense is the pick over the 35B-A3B. The 35B-A3B’s speed-and-breadth wins are real, but the diff fidelity gap shows up on long sessions.

The 122B-A10B — the Mac Studio tier

122 billion total parameters, 10 billion active. Built for machines with 48GB+ unified memory.

Benchmark122B-A10BGPT-5 miniClaude Sonnet 4.5
MMLU-Pro86.783.780.8
SWE-bench Verified72.072.077.2
Terminal-Bench 249.431.918.7
BFCL-V4 (Tool Use)72.255.554.8
BrowseComp63.848.141.1

Scores from Qwen’s model card. Sonnet 4.5 SWE-bench from Anthropic.

Terminal-Bench 2 at 49.4 vs GPT-5 mini’s 31.9 isn’t a close race. BFCL-V4 at 72.2 vs 55.5 is a 30% margin on tool use. The 122B-A10B at Q4 (~70 GB) is what justifies a Mac Studio with 96GB+ unified memory, or a dual-GPU rig totaling 80GB+, or DGX-Spark-class server hardware.

There is no 3.6 equivalent at this tier. If you have the hardware, the 122B-A10B is the strongest local open option Qwen ships in mid-2026, full stop.

The 397B-A17B flagship

The top of the open Qwen lineup. ~214 GB at Q4 — Mac Ultra with 192GB+ unified memory, dedicated server hardware, or DGX-Spark / Strix Halo 128GB territory (tight at Q4).

Reportedly runs at 45 tok/s on 8x H100s. Beats GPT-5.2 on IFBench (76.5 vs 75.4, highest of any model tested) and MultiChallenge (67.6 vs 57.9). Trails on AIME 2026 (91.3 vs 96.7) and SWE-bench (76.4 vs 80.0). Competitive with frontier closed systems, but the hardware ask is real.

For most local-AI users, the 122B or 35B-A3B cover the same ground at a fraction of the hardware cost. The 397B is for the people running it for specific frontier-adjacent workloads that justify the rig.


Backend Status (mid-June 2026)

The launch-week rough edges are gone. Current backend reality:

  • Ollama (0.30.x)ollama run qwen3.5:35b and ollama run qwen3.5:27b-q4_K_M work out of the box. Multimodal (vision) works on the 3.5 path. The 3.6 vision split (separate mmproj) caused earlier friction in Ollama specifically; the 3.5 vision flow doesn’t have that complication.
  • llama.cpp (current mainline) — Qwen 3.5 has been stable across builds for months at this point. Speculative decoding for MoE (MTP) landed in mainline via PR #22673 on May 16, 2026, so the older llama-mtp fork workaround is no longer needed. For multi-GPU 27B configs, the early multi-GPU graph-split issues (#19860 / #19866) have been fixed for months — any recent mainline build is fine.
  • vLLM (0.23.0+) — full 3.5 support including FP8, GPTQ, MTP speculative decoding, and the GDN (Gated DeltaNet) reasoning parser. The single-model-locked + 90% VRAM pre-allocation behavior of vLLM still applies; for the platform reality across runtimes, see the llama.cpp vs Ollama vs vLLM guide.
  • SGLang — Qwen team recommends it for production serving. MTP available; GDN support.
  • LM Studio — covers Qwen 3.5 fine for the desktop-GUI use case.

GGUF sources to know:

  • unsloth/Qwen3.5-35B-A3B-GGUF and unsloth/Qwen3.5-27B-GGUF — the Unsloth dynamic-quant variants, with the UD-Q4_K_M / UD-Q5_K_XL etc. tags.
  • bartowski/Qwen_Qwen3.5-35B-A3B-GGUF and bartowski/Qwen_Qwen3.5-27B-GGUF — bartowski’s imatrix quants.
  • mradermacher maintains static and imatrix quant repos across the catalog too.

For the Qwen-org direct paths: only the base repos (e.g. Qwen/Qwen3.5-35B-A3B) exist under the official Qwen org. There is no Qwen/Qwen3.5-35B-A3B-GGUF. The GGUF files live under the quantizer accounts above. If a guide tells you to run llama-server -hf Qwen/Qwen3.5-35B-A3B-GGUF:Q4_K_M, the path is wrong; use the Unsloth or bartowski repo instead.


Quantization Notes

Unsloth’s benchmarks on the 35B-A3B (vendor-attributed):

QuantTop-1 Token AgreementNotes
Q4_K_L89%Best quality retention at 4-bit
MXFP4Good (PPL +1.38)New format, fast
UD-Q4_K_XL79.4%Lowest quality at 4-bit

Q4_K_L retains the highest quality at 4-bit. If your VRAM fits it, prefer Q4_K_L over Q4_K_XL.

For the 397B, Unsloth reports UD-Q4_K_XL stays within 1 point of accuracy on most benchmarks despite cutting file size by ~500GB. At flagship scale, aggressive quantization hurts less.

Other findings worth knowing:

  • 8-bit KV cache improves output quality when running 4-bit model quants.
  • FP8 weights are available officially for every 3.5 size, giving a clean middle ground between full precision and GGUF quants.
  • For the deeper background, the quantization decision frame in the Qwen 3.6 / RTX 3090 piece carries over to 3.5. The trade-off shape is the same.

How to Run It

Ollama (simplest):

ollama run qwen3.5:35b
# or for the 27B dense (coder pick):
ollama run qwen3.5:27b-q4_K_M

Multimodal works out of the box on the 3.5 path.

llama.cpp (most control):

# Pull from unsloth (the GGUF source that actually exists)
./llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
 --jinja --reasoning-format deepseek -ngl 99

Any recent mainline build is fine. The multi-GPU graph-split fixes are months old; MTP speculative decoding is in mainline since May 16.

Disable thinking by default (saves tokens on simple tasks):

./llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M \
 --jinja --chat-template-kwargs '{"enable_thinking": false}' -ngl 99

You can also add /think or /nothink to individual messages to toggle per-request.

Recommended sampling parameters (from Qwen):

  • General thinking mode: temperature 1.0, top_p 0.95, top_k 20, presence_penalty 1.5
  • Coding mode: temperature 0.6, top_p 0.95, top_k 20, presence_penalty 0.0
  • Non-thinking instruct: temperature 0.7, top_p 0.8, top_k 20, presence_penalty 1.5

Known Issues, Mid-2026 Status

Most launch-era 3.5 rough edges are resolved. A short list of what’s still worth knowing:

  • DeltaNet CUDA throughput. The hybrid DeltaNet architecture launched with measurable CUDA-side overhead vs standard attention — community reports flagged roughly a 35% gap vs older MoE on some configs. Throughput has improved across llama.cpp builds since but isn’t fully optimized; running a recent mainline build matters here more than for simpler architectures.
  • Repetition / looping. If you see it, raise --presence-penalty to 1.5 (up to 2.0).
  • Vision via Ollama on 3.5 works. The vision friction reports you may have seen were specifically about 3.6’s separate mmproj file in Ollama; 3.5’s multimodal flow doesn’t have the same split.
  • Multi-GPU 27B. Launch-week dual-3090 crashes (#19860, fixed via #19866 in February) are long past; any current mainline build handles this.

Qwen 3.5 vs Qwen 3 — Historical Context

For readers comparing against the older Qwen 3 lineup:

Qwen 3Qwen 3.5
Dense model32B27B (denser, better benchmarks)
Small MoE30B-A3B35B-A3B (5B more total params)
Medium MoE122B-A10B (new tier)
Large MoE235B-A22B397B-A17B
ArchitectureStandard attentionHybrid DeltaNet + attention (3:1)
MultimodalSeparate VL modelsNative in all models
Context128K262K (1M via YaRN)
FP8 weightsCommunity onlyOfficial
Vocabulary152K tokens250K tokens

The 35B-A3B beat the previous flagship Qwen3-235B on language, vision, and agent benchmarks despite being about 7× smaller in total parameters. The architectural shift to DeltaNet plus broader vocabulary is the reason.

The broader family tree across versions lives in the Qwen Models Guide.


Bottom Line

For the 35B-A3B / 27B dense class (the two tiers 3.6 also covers):

  • Fresh deployment? Start with 3.6 for the benchmark deltas (Terminal-Bench, MCPMark, SWE-bench all moved the right direction).
  • Already on stable 3.5? Don’t rush. The upgrade is “worth doing eventually,” not “fix-it-tomorrow.”
  • Need vision through Ollama specifically? 3.5 is the smoother path right now.

For the 122B-A10B and 397B-A17B tiers (no 3.6 equivalent):

  • Run 3.5. There’s no other open option from Qwen at these scales right now.
  • The 122B-A10B remains the strongest local open Qwen for heavy tool-use workflows on 48GB+ unified memory.
  • The 397B-A17B is the open frontier if you have server hardware.

Don’t expect a 3.7 open equivalent any time soon. As of mid-June 2026, the 3.7 generation is closed and API-only across Max, Plus, and VLA, with no Qwen3.7-* repository on the official Hugging Face org. The open weights vs closed frontier (2026) breakdown tracks that question.

# 35B-A3B default, still works fine on Ollama
ollama run qwen3.5:35b
# 27B dense, sustained-coding pick
ollama run qwen3.5:27b-q4_K_M
# llama.cpp direct (use the GGUF source that actually exists)
./llama-server -hf unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_M --jinja -ngl 99
# Upgrading the 35B-A3B class? Step to 3.6:
ollama run qwen3.6

Related Guides

Get notified when we publish new guides.

Subscribe — free, no spam