Qwen3.5-0.8B-MTP-GGUF
Qwen3.5-0.8B from Alibaba's Qwen team is the smallest model in the Qwen3.5 family, an ultra-compact 0.8B-parameter dense multimodal language model with a hybrid Gated DeltaNet + sparse MoE architecture, 24 layers, 1024 hidden dimension, 248K vocabulary spanning 201 languages, multi-token prediction, and a massive 262K native context window (extensible to 1M+ tokens via YaRN) for unified text and image understanding at extreme efficiency. Designed under the "More Size, Less Waste" philosophy, it achieves 10.5 on the Artificial Analysis Intelligence Index—ranking #383 overall but exceptional for sub-1B models—while running at blazing-fast latencies (0.00s time-to-first-token #12 globally) with ~1.6GB VRAM requirement (BF16) or ~0.5GB in 4-bit quantization, making it ideal for Raspberry Pi, mobile phones, and embedded IoT devices. Apache 2.0-licensed with Ollama/vLLM/llama.cpp support, it excels at lightweight OCR, document parsing, multilingual chatbots, visual QA, and basic coding tasks as the most accessible entry point for on-device multimodal AI without requiring cloud dependencies.
Multi-Token Prediction (MTP) GGUF is a specialized GGUF model file format extension that integrates speculative decoding directly into the model weights to significantly accelerate local inference. Unlike traditional speculative decoding which requires a separate, smaller "draft" model, MTP GGUF files include additional output heads within the main model architecture that predict multiple future tokens in a single forward pass.
Model Files
| File Name | Quant Type | File Size | File Link |
|---|---|---|---|
| Qwen3.5-0.8B.BF16.gguf | BF16 | 1.56 GB | Download |
| Qwen3.5-0.8B.F16.gguf | F16 | 1.56 GB | Download |
| Qwen3.5-0.8B.Q2_K.gguf | Q2_K | 430 MB | Download |
| Qwen3.5-0.8B.Q3_K_L.gguf | Q3_K_L | 502 MB | Download |
| Qwen3.5-0.8B.Q3_K_M.gguf | Q3_K_M | 476 MB | Download |
| Qwen3.5-0.8B.Q3_K_S.gguf | Q3_K_S | 444 MB | Download |
| Qwen3.5-0.8B.Q4_0.gguf | Q4_0 | 513 MB | Download |
| Qwen3.5-0.8B.Q4_K_M.gguf | Q4_K_M | 542 MB | Download |
| Qwen3.5-0.8B.Q4_K_S.gguf | Q4_K_S | 517 MB | Download |
| Qwen3.5-0.8B.Q5_0.gguf | Q5_0 | 578 MB | Download |
| Qwen3.5-0.8B.Q5_K_M.gguf | Q5_K_M | 593 MB | Download |
| Qwen3.5-0.8B.Q5_K_S.gguf | Q5_K_S | 578 MB | Download |
| Qwen3.5-0.8B.Q6_K.gguf | Q6_K | 647 MB | Download |
| Qwen3.5-0.8B.Q8_0.gguf | Q8_0 | 834 MB | Download |
| Qwen3.5-0.8B.mmproj-bf16.gguf | mmproj-bf16 | 207 MB | Download |
| Qwen3.5-0.8B.mmproj-f16.gguf | mmproj-f16 | 207 MB | Download |
| Qwen3.5-0.8B.mmproj-q8_0.gguf | mmproj-q8_0 | 116 MB | Download |
Quants Usage
(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)
Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):
- Downloads last month
- 1,940
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
