VOOZH about

URL: https://huggingface.co/mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF

โ‡ฑ mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF ยท Hugging Face


โšก Each donation = another big MoE quantized

I host 30+ free APEX MoE quantizations as independent research. My only local hardware is an NVIDIA DGX Spark (122 GB unified memory) โ€” enough for ~30-50B-class MoEs, but bigger ones (200B+) require rented compute on H100/H200/Blackwell, typically $20-100 per quant.
If APEX quants are useful to you, your support directly funds those bigger runs.

๐ŸŽ‰ Patreon (Monthly)  |  โ˜• Buy Me a Coffee  |  โญ GitHub Sponsors

Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled โ€” APEX-MTP GGUF

APEX (Adaptive Precision for EXpert Models) quantizations of lordx64/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled, with the MTP (multi-token prediction) head bundled for in-the-box self-speculative decoding.

Brought to you by the LocalAI team | APEX Project | Technical Report

What's different from the plain APEX repo?

These GGUFs bundle the model's MTP (multi-token prediction) head alongside the trunk in a single file, courtesy of llama.cpp PR #22673. With a recent llama.cpp (>= commit 255582687) you can enable self-speculative decoding using just this one file โ€” no separate draft model needed:

llama-server -m Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-I-Balanced.gguf --draft-mtp

The non-MTP version is still available at mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-GGUF โ€” slightly smaller, but no self-spec.

File sizes

Each quant is ~2.5% larger than its non-MTP counterpart (one extra transformer-block worth of weights, no embedding duplication since MTP shares the trunk's embed_tokens).

MTP draft head precision

The bundled MTP head (blk.40.* including the nextn.* projection + norms) is quantized to Q8_0 (near-lossless) on every tier except I-Nano. I-Nano keeps the trunk-tier precision on the MTP block (Q3_K routed experts, Q4_K attention) but pins blk.40.nextn.eh_proj to Q4_K โ€” see the explainer below.

This keeps draft accuracy high (important for spec-decode acceptance rate) at a modest ~1 GB cost per file vs. trunk-tier precision.

Why the MTP head doesn't use imatrix

llama-imatrix runs normal forward passes that only activate the trunk (blk.0..blk.39). The MTP head only fires during --draft-mtp spec decoding, so its tensors get no imatrix activation data. We work around this by quantizing the MTP head with static K-quant / Q8_0 which doesn't require imatrix.

(A patch to llama-imatrix that records MTP activations during collection is in progress at mudler/llama.cpp#mtp-imatrix โ€” once upstream this will let us push the drafter to lower bit-widths cleanly.)

What is APEX?

APEX is a MoE-aware mixed-precision quantization strategy. Per-tensor-role gradient: routed experts compress hardest, shared experts kept high (always active), attention/Mamba uniform; 5+5 symmetric edge gradient across the 40 trunk layers + MTP layer 40 at edge precision. I-variants use diverse imatrix calibration (chat, code, reasoning, tool-calling, agentic traces, Wikipedia).

See the APEX project for full details.

Architecture

  • Base: Qwen 3.6 35B-A3B family (Qwen3_5MoeForCausalLM)
  • Layers: 40 trunk + 1 MTP (bundled)
  • Experts: 256 routed + 1 shared (8 active per token)
  • Hidden size: 2048
  • Calibration: v1.3 diverse dataset

Credits

  • APEX quantization: LocalAI team
  • MTP support: llama.cpp PR #22673 by Aman Gupta + ggerganov
  • Built on llama.cpp
Downloads last month
39,224
GGUF
Model size
36B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF

Collection including mudler/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled-APEX-MTP-GGUF