⚡ Each donation = another big MoE quantized

I host 30+ free APEX MoE quantizations as independent research. My only local hardware is an NVIDIA DGX Spark (122 GB unified memory) — enough for ~30-50B-class MoEs, but bigger ones (200B+) require rented compute on H100/H200/Blackwell, typically $20-100 per quant.
If APEX quants are useful to you, your support directly funds those bigger runs.

🎉 Patreon (Monthly) | ☕ Buy Me a Coffee | ⭐ GitHub Sponsors

Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — APEX-MTP GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, with the MTP (multi-token prediction) head bundled for in-the-box self-speculative decoding.

Brought to you by the LocalAI team | APEX Project | Technical Report

What's different from the plain APEX repo?

These GGUFs bundle the model's MTP (multi-token prediction) head alongside the trunk in a single file, courtesy of llama.cpp PR #22673. With a recent llama.cpp (>= commit 255582687) you can enable self-speculative decoding using just this one file — no separate draft model needed:

llama-server -m Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-I-Balanced.gguf --draft-mtp

The non-MTP version is still available at mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-GGUF — slightly smaller, but no self-spec.

File sizes

Each quant is ~2.5% larger than its non-MTP counterpart (one extra transformer-block worth of weights, no embedding duplication since MTP shares the trunk's embed_tokens).

MTP draft head precision

The bundled MTP head (blk.40.* including the nextn.* projection + norms) is quantized to Q8_0 (near-lossless) on every tier except I-Nano. I-Nano keeps the trunk-tier precision on the MTP block (Q3_K routed experts, Q4_K attention) but pins blk.40.nextn.eh_proj to Q4_K — see the explainer below.

This keeps draft accuracy high (important for spec-decode acceptance rate) at a modest ~1 GB cost per file vs. trunk-tier precision.

Why the MTP head doesn't use imatrix

llama-imatrix runs normal forward passes that only activate the trunk (blk.0..blk.39). The MTP head only fires during --draft-mtp spec decoding, so its tensors get no imatrix activation data. We work around this by quantizing the MTP head with static K-quant / Q8_0 which doesn't require imatrix.

(A patch to llama-imatrix that records MTP activations during collection is in progress at mudler/llama.cpp#mtp-imatrix — once upstream this will let us push the drafter to lower bit-widths cleanly.)

What is APEX?

APEX is a MoE-aware mixed-precision quantization strategy. Per-tensor-role gradient: routed experts compress hardest, shared experts kept high (always active), attention/Mamba uniform; 5+5 symmetric edge gradient across the 40 trunk layers + MTP layer 40 at edge precision. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

See the APEX project for full details.

Architecture

Base: Qwen 3.6 35B-A3B family (Qwen3_5MoeForCausalLM)
Layers: 40 trunk + 1 MTP (bundled)
Experts: 256 routed + 1 shared (8 active per token)
Hidden size: 2048
Calibration: v1.3 diverse dataset

Credits

APEX quantization: LocalAI team
MTP support: llama.cpp PR #22673 by Aman Gupta + ggerganov
Built on llama.cpp

Downloads last month: 39,224

GGUF

Model size

36B params

Architecture

qwen35moe

Hardware compatibility

16-bit

View +8 variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF

Base model

Qwen/Qwen3.6-35B-A3B

Adapter

lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled

Quantized

(38)

this model

Collection including mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF

MoE models quantized with the APEX Quantization technique ( https://github.com/mudler/apex-quant ) • 36 items • Updated 19 days ago • 115

URL: https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF

⇱ mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF · Hugging Face

⚡ Each donation = another big MoE quantized

Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled — APEX-MTP GGUF

What's different from the plain APEX repo?

File sizes

MTP draft head precision

Why the MTP head doesn't use imatrix

What is APEX?

Architecture

Credits

Model tree for mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF

Collection including mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF