Qwen3.6-27B Hybrid-Optimized Quantization for 16 GB of VRAM
At ggufbench.com, we are always looking to advance local AI on average consumer hardware. We would love it if you took a minute to submit your performance results and launch arguments for this model, and others, on our website!
Quick Specs
- File Size: 12.576 GiB
- Avg Bits/Weight: 4.01 bpw
- Target VRAM: 16 GB GPUs
Architecture-Aware Quant Strategy
Qwen3.6-27B is a hybrid Mamba/Transformer model. Not all layers serve the same purpose, and not all tensors tolerate quantization equally. This layout respects the architecture by:
- Protecting pure-attention layers (
blk.3,7,11...63) with higher precision for global reasoning and long-range focus. - Compressing SSM-dominated hybrid layers aggressively where the recurrent state carries the sequential load.
- Preserving critical routing & projection tensors at native or near-native precision to prevent error compounding.
- Downgrading resilient tensors (embeddings, FFN gate/up) where KLD sensitivity is flat and quality loss is imperceptible.
Benchmark Summary (WikiText-2, 580 chunks)
| Metric | This | sokann (4.256 bpw) | bartowski Q3_K_M | mradermacher i1.IQ4_XS | bartowski IQ4_XS |
|---|---|---|---|---|---|
| Size (BPW) | 4.01 | 4.256 | 4.270 | 4.483 | 4.556 |
| Size (GiB) | 12.576 | 13.327 | 13.370 | 14.036 | 14.266 |
| Mean PPL(Q) | 7.128552 ± 0.046785 | 7.098696 ± 0.047344 | 6.993009 ± 0.046208 | 7.020660 ± 0.046587 | 6.996323 ± 0.046332 |
| Mean PPL(base) | 6.900925 ± 0.045382 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 | 6.908506 ± 0.045543 |
| Cor(ln(PPL(Q)), ln(PPL(base))) | 98.76% | 99.19% | 98.52% | 99.30% | 99.32% |
| Mean KLD | 0.049767 ± 0.000745 | 0.033452 ± 0.000723 | 0.058818 ± 0.000881 | 0.046348 ± 0.000841 | 0.026270 ± 0.000653 |
| Maximum KLD | 22.599598 | 23.255085 | 24.616274 | 24.175169 | 22.992002 |
| 99.9% KLD | 3.240748 | 2.907350 | 3.986622 | 3.614290 | 2.385293 |
| RMS Δp | 6.167 ± 0.054 % | 4.936 ± 0.054 % | 6.690 ± 0.059 % | 5.867 ± 0.060 % | 4.352 ± 0.057 % |
| Same top p | 91.146 ± 0.074 % | 92.427 ± 0.069 % | 90.350 ± 0.077 % | 93.903 ± 0.062 % | 93.888 ± 0.062 % |
- Efficiency-First Parity: Achieves competitive quality at ~6% smaller size — PPL(Q) within 0.4% of
sokann(4.256 bpw) and on par withbartowski Q3_K_Mon KLD, all while saving ~750 MB of VRAM for larger context or higher batch sizes.
Acknowledgments
- Special thanks to unsloth for their 9 TB of Qwen3.5 GGUF Benchmarks, which were instrumental in selecting the optimal quantization strategy for this model.
- Thanks to bartowski for providing the calibration data used in this process.
- Downloads last month
- 3,222
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for ggufbench/Qwen3.6-27B-4bpw-16GB-VRAM
Base model
Qwen/Qwen3.6-27B