VOOZH about

URL: https://willitrunai.com/blog/best-local-coding-llms-apple-silicon-24gb

⇱ Best Coding LLMs for Apple Silicon 24GB — Ranked 2026 | Will It Run AI Blog


Best local coding LLMs for 24GB Apple Silicon in 2026 — ranked picks for M4 Pro, M4 Max 36GB, and M3 Pro, with tok/s estimates, recommended quantization, and integration notes for Cursor / Continue.dev / VSCode.

For the ranked model list against your specific hardware, see:

Top coding picks at 24GB unified memory

RankModelVRAM Q4tok/s (M4 Pro)Best for
1Qwen 3 Coder 30B-A3B~17 GB~30-35Overall coding champion; MoE sparsity keeps inference fast
2Qwen 3.5 35B-A3B~21 GB~30Tight but strong general+coding MoE
3Qwen 3 Coder 14B~8 GB~55Fastest respectable coding model; perfect for Cursor-style flows
4Qwen 3.5 27B~16 GB~35Dense alternative; more predictable latency
5DeepSeek Coder V2.5 Lite~14 GB~40Different style, strong on Python/TS
6Qwen 3 14B~8 GB~50Not fine-tuned for code but fast and capable
7Gemma 3 9B~6 GB~60Lightweight fallback; good for quick Q&A

Why Qwen 3 Coder 30B-A3B wins

The MoE architecture (30B total, 3B active per token) gives it the knowledge breadth of a 30B dense model while running at the speed of a 3B dense model. On a 24GB M4 Pro Mac you get:

  • ~17 GB loaded into unified memory
  • ~7 GB headroom for KV cache and macOS/apps
  • ~30-35 tok/s sustained (active-cooled Pro)
  • Full 262K context without extra memory pressure

For repo-level refactors and agentic workflows (where the model generates multiple tool-calls per turn), this combination is unmatched at 24GB.

When to pick Qwen 3.5 35B-A3B instead

If you want the general-purpose MoE (chat + coding + reasoning), Qwen 3.5 35B-A3B edges out Qwen 3 Coder 30B-A3B on non-code tasks. Coding performance is very close. The cost is ~4 GB more VRAM — on 24GB Macs this means fewer open apps during sessions.

When open weights ship, Qwen3.6-35B-A3B will inherit this slot with the added 1M-context capability for agentic coding.

Quantization: why you want Q5_K_M for code

Code is syntax-sensitive. A missing bracket or quote character due to aggressive quantization destroys the output. Q4_K_M is acceptable for chat-style coding assistance but we have seen reliable quality gains moving to Q5_K_M or Q6_K:

Quant30B-A3B VRAMCode quality delta vs FP16
Q4_K_M~17 GB-3 to -5% (occasional syntax slips)
Q5_K_M~20 GB-1 to -2% (effectively identical for most tasks)
Q6_K~24 GB< -1% (near-lossless; won't fit 30B-A3B on 24GB Mac)
Q8_0~32 GBNo measurable delta (requires 32GB+ Mac)

On a 24GB Mac, stick with Q4_K_M for the 30B-A3B class. If you have a 36GB+ Mac, step up to Q5 or Q6.

Integration with coding tools

All of the picks above expose an OpenAI-compatible endpoint via Ollama or LM Studio, so any tool that speaks OpenAI works.

Ollama (recommended):

ollama pull qwen3-coder:30b-a3b
ollama run qwen3-coder:30b-a3b
# endpoint: http://localhost:11434/v1

LM Studio: Search Qwen3-Coder-30B-A3B-Instruct-GGUF, pick Q4_K_M, start server.

Cursor:

  • Settings → Models → Add custom model
  • Base URL: http://localhost:11434/v1
  • Model: qwen3-coder:30b-a3b

Continue.dev (VSCode):

{
 "models": [
 {
 "title": "Qwen 3 Coder 30B-A3B (local)",
 "provider": "ollama",
 "model": "qwen3-coder:30b-a3b"
 }
 ]
}

MLX vs GGUF on Apple Silicon

  • MLX (Apple's native ML framework) delivers ~15-25% faster tok/s than llama.cpp GGUF on M-series chips.
  • GGUF is more mature, has wider tool support (Ollama, LM Studio, Continue.dev out of the box), and the ecosystem is larger.
  • Recommendation for 2026: Start with GGUF via Ollama for ease of use. If you hit bandwidth limits and want the extra tok/s, switch to MLX with mlx-community models — see our Qwen 3.5 MLX guide.

What about coding on smaller Macs (16 GB)?

If you have a 16 GB Mac, the coding LLM roster is different — see Best AI models for a 16GB Mac for the tailored list. Short version: Qwen 3 Coder 14B at Q4_K_M or Gemma 4 E4B at Q8 are the daily drivers.

Related

Frequently Asked Questions