Voozh

Best local coding LLMs for 24GB Apple Silicon in 2026 — ranked picks for M4 Pro, M4 Max 36GB, and M3 Pro, with tok/s estimates, recommended quantization, and integration notes for Cursor / Continue.dev / VSCode.

For the ranked model list against your specific hardware, see:

Top coding picks at 24GB unified memory

Rank	Model	VRAM Q4	tok/s (M4 Pro)	Best for
1	Qwen 3 Coder 30B-A3B	~17 GB	~30-35	Overall coding champion; MoE sparsity keeps inference fast
2	Qwen 3.5 35B-A3B	~21 GB	~30	Tight but strong general+coding MoE
3	Qwen 3 Coder 14B	~8 GB	~55	Fastest respectable coding model; perfect for Cursor-style flows
4	Qwen 3.5 27B	~16 GB	~35	Dense alternative; more predictable latency
5	DeepSeek Coder V2.5 Lite	~14 GB	~40	Different style, strong on Python/TS
6	Qwen 3 14B	~8 GB	~50	Not fine-tuned for code but fast and capable
7	Gemma 3 9B	~6 GB	~60	Lightweight fallback; good for quick Q&A

Why Qwen 3 Coder 30B-A3B wins

The MoE architecture (30B total, 3B active per token) gives it the knowledge breadth of a 30B dense model while running at the speed of a 3B dense model. On a 24GB M4 Pro Mac you get:

~17 GB loaded into unified memory
~7 GB headroom for KV cache and macOS/apps
~30-35 tok/s sustained (active-cooled Pro)
Full 262K context without extra memory pressure

For repo-level refactors and agentic workflows (where the model generates multiple tool-calls per turn), this combination is unmatched at 24GB.

When to pick Qwen 3.5 35B-A3B instead

If you want the general-purpose MoE (chat + coding + reasoning), Qwen 3.5 35B-A3B edges out Qwen 3 Coder 30B-A3B on non-code tasks. Coding performance is very close. The cost is ~4 GB more VRAM — on 24GB Macs this means fewer open apps during sessions.

When open weights ship, Qwen3.6-35B-A3B will inherit this slot with the added 1M-context capability for agentic coding.

Quantization: why you want Q5_K_M for code

Code is syntax-sensitive. A missing bracket or quote character due to aggressive quantization destroys the output. Q4_K_M is acceptable for chat-style coding assistance but we have seen reliable quality gains moving to Q5_K_M or Q6_K:

Quant	30B-A3B VRAM	Code quality delta vs FP16
Q4_K_M	~17 GB	-3 to -5% (occasional syntax slips)
Q5_K_M	~20 GB	-1 to -2% (effectively identical for most tasks)
Q6_K	~24 GB	< -1% (near-lossless; won't fit 30B-A3B on 24GB Mac)
Q8_0	~32 GB	No measurable delta (requires 32GB+ Mac)

On a 24GB Mac, stick with Q4_K_M for the 30B-A3B class. If you have a 36GB+ Mac, step up to Q5 or Q6.

Integration with coding tools

All of the picks above expose an OpenAI-compatible endpoint via Ollama or LM Studio, so any tool that speaks OpenAI works.

Ollama (recommended):

ollama pull qwen3-coder:30b-a3b
ollama run qwen3-coder:30b-a3b
# endpoint: http://localhost:11434/v1

LM Studio: Search Qwen3-Coder-30B-A3B-Instruct-GGUF, pick Q4_K_M, start server.

Cursor:

Settings → Models → Add custom model
Base URL: http://localhost:11434/v1
Model: qwen3-coder:30b-a3b

Continue.dev (VSCode):

{
 "models": [
 {
 "title": "Qwen 3 Coder 30B-A3B (local)",
 "provider": "ollama",
 "model": "qwen3-coder:30b-a3b"
 }
 ]
}

MLX vs GGUF on Apple Silicon

MLX (Apple's native ML framework) delivers ~15-25% faster tok/s than llama.cpp GGUF on M-series chips.
GGUF is more mature, has wider tool support (Ollama, LM Studio, Continue.dev out of the box), and the ecosystem is larger.
Recommendation for 2026: Start with GGUF via Ollama for ease of use. If you hit bandwidth limits and want the extra tok/s, switch to MLX with mlx-community models — see our Qwen 3.5 MLX guide.

What about coding on smaller Macs (16 GB)?

If you have a 16 GB Mac, the coding LLM roster is different — see Best AI models for a 16GB Mac for the tailored list. Short version: Qwen 3 Coder 14B at Q4_K_M or Gemma 4 E4B at Q8 are the daily drivers.

URL: https://willitrunai.com/blog/best-local-coding-llms-apple-silicon-24gb

⇱ Best Coding LLMs for Apple Silicon 24GB — Ranked 2026 | Will It Run AI Blog