Paper • 2510.13999 • Published • 20
Qwen3.6-28B-REAP20-A3B — GGUF Quantizations
GGUF quantizations of 0xSero/Qwen3.6-28B-REAP20-A3B, a 20% expert-pruned variant of Qwen/Qwen3.6-35B-A3B using the REAP (Router-weighted Expert Activation Pruning) method.
Available Files
| File | Quant | Size | BPW | Description |
|---|---|---|---|---|
Qwen3.6-28B-REAP20-A3B-BF16.gguf |
BF16 | ~56.5 GB | 16.0 | Full precision, for re-quantization |
Qwen3.6-28B-REAP20-A3B-Q8_0.gguf |
Q8_0 | ~30 GB | 8.0 | Near-lossless, large file |
Qwen3.6-28B-REAP20-A3B-Q6_K.gguf |
Q6_K | ~23 GB | 6.56 | Near-lossless, recommended for high quality |
Qwen3.6-28B-REAP20-A3B-Q5_K_M.gguf |
Q5_K_M | ~20 GB | 5.68 | High quality, larger size |
Qwen3.6-28B-REAP20-A3B-Q5_K_S.gguf |
Q5_K_S | ~19 GB | 5.52 | High quality, slightly smaller |
Qwen3.6-28B-REAP20-A3B-Q4_K_M.gguf |
Q4_K_M | ~17 GB | 4.89 | Recommended — best quality/size balance |
Qwen3.6-28B-REAP20-A3B-Q4_K_S.gguf |
Q4_K_S | ~16 GB | 4.63 | 4-bit small |
Qwen3.6-28B-REAP20-A3B-Q3_K_L.gguf |
Q3_K_L | ~15 GB | 4.27 | 3-bit large |
Qwen3.6-28B-REAP20-A3B-Q3_K_M.gguf |
Q3_K_M | ~14 GB | 3.91 | 3-bit medium |
Qwen3.6-28B-REAP20-A3B-Q3_K_S.gguf |
Q3_K_S | ~13 GB | 3.66 | 3-bit small |
Qwen3.6-28B-REAP20-A3B-IQ3_XXS.gguf |
IQ3_XXS | ~12 GB | 3.06 | Ultra-small, imatrix-based |
Qwen3.6-28B-REAP20-A3B-Q2_K.gguf |
Q2_K | ~11 GB | 2.96 | Smallest size, lowest quality |
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3.6 MoE (hybrid Gated DeltaNet + MoE) |
| Parameters | ~28B total / ~3B active per token |
| Experts | 205 total / 8 active per token (pruned from 256) |
| Context Length | 262,144 tokens |
| Original dtype | BF16 |
| Quantization source | BF16 GGUF from 0xSero/Qwen3.6-28B-REAP20-A3B-GGUF |
| Quantization tool | llama.cpp |
| imatrix | Used for IQ3_XXS (from source repo) |
| License | Apache 2.0 |
Quantization Process
# 1. Download BF16 GGUF from source
huggingface-cli download 0xSero/Qwen3.6-28B-REAP20-A3B-GGUF \
--include "model.bf16.gguf" --local-dir ./
# 2. Download imatrix (for IQ quants)
huggingface-cli download 0xSero/Qwen3.6-28B-REAP20-A3B-GGUF \
--include "imatrix.dat" --local-dir ./
# 3. Quantize (example: Q4_K_M)
llama-quantize model.bf16.gguf Qwen3.6-28B-REAP20-A3B-Q4_K_M.gguf Q4_K_M
# 4. Quantize with imatrix (example: IQ3_XXS)
llama-quantize --imatrix imatrix.dat model.bf16.gguf \
Qwen3.6-28B-REAP20-A3B-IQ3_XXS.gguf IQ3_XXS
Usage
llama.cpp
llama-cli \
-m Qwen3.6-28B-REAP20-A3B-Q4_K_M.gguf \
-ngl 99 -c 4096 \
-p "Your prompt here"
llama-server (OpenAI-compatible API)
llama-server \
-m Qwen3.6-28B-REAP20-A3B-Q4_K_M.gguf \
-ngl 99 -c 4096 \
--port 8080
LM Studio / Jan / Ollama
Download the .gguf file and load it directly in your preferred local inference UI.
Hardware Requirements
| Config | VRAM / RAM |
|---|---|
| Full GPU (Q4_K_M, recommended) | 20+ GB VRAM |
| Hybrid CPU+GPU (Q4_K_M) | 10 GB VRAM + 10 GB RAM |
| CPU only (Q4_K_M) | 24+ GB RAM |
About the Original Model
0xSero/Qwen3.6-28B-REAP20-A3B applies REAP expert pruning (arXiv:2510.13999) to remove 20% of MoE experts (51 of 256 per layer) from Qwen3.6-35B-A3B, while preserving routing behavior via router weight renormalization. Active parameters per token remain unchanged at ~3B. The result is a ~25% smaller model with competitive generation quality across coding, reasoning, and knowledge benchmarks.
License
Apache 2.0 — see Qwen License.
- Downloads last month
- 12,861
GGUF
Model size
28B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
