Qwen3.5-9B-MTP-GGUF
Qwen3.5-9B from Alibaba's Qwen team is a 9B-parameter dense multimodal language model featuring a hybrid Gated DeltaNet + Gated Attention architecture with 262K native context window (extensible to 1M+ tokens via RoPE scaling), 248K vocabulary supporting 201 languages, and early‑fusion training for unified text, image, and video understanding. It achieves SOTA performance across modalities with 89.2% OCRBench, 84.5% VideoMME, 78.9% MathVision, and 70.1% MMMU-Pro, while delivering production-level agentic capabilities including 66.1% BFCL-V4 and 79.1% TAU2-Bench for native tool calling, plus toggleable thinking mode for step-by-step reasoning on complex tasks. Apache 2.0-licensed and optimized for vLLM/SGLang/llama.cpp/Ollama deployment (~18GB VRAM BF16, ~5GB 4-bit), the instruction-tuned variant excels at repository-level coding, frontend development, document/PDF parsing, visual question answering, and multilingual chatbots as a scalable foundation for edge-to-server multimodal agents.
Multi-Token Prediction (MTP) GGUF is a specialized GGUF model file format extension that integrates speculative decoding directly into the model weights to significantly accelerate local inference. Unlike traditional speculative decoding which requires a separate, smaller "draft" model, MTP GGUF files include additional output heads within the main model architecture that predict multiple future tokens in a single forward pass.
Model Files
| File Name | Quant Type | File Size | File Link |
|---|---|---|---|
| Qwen3.5-9B.BF16.gguf | BF16 | 18.4 GB | Download |
| Qwen3.5-9B.F16.gguf | F16 | 18.4 GB | Download |
| Qwen3.5-9B.Q2_K.gguf | Q2_K | 3.91 GB | Download |
| Qwen3.5-9B.Q3_K_L.gguf | Q3_K_L | 5.05 GB | Download |
| Qwen3.5-9B.Q3_K_M.gguf | Q3_K_M | 4.74 GB | Download |
| Qwen3.5-9B.Q3_K_S.gguf | Q3_K_S | 4.36 GB | Download |
| Qwen3.5-9B.Q4_0.gguf | Q4_0 | 5.45 GB | Download |
| Qwen3.5-9B.Q4_K_M.gguf | Q4_K_M | 5.78 GB | Download |
| Qwen3.5-9B.Q4_K_S.gguf | Q4_K_S | 5.49 GB | Download |
| Qwen3.5-9B.Q5_0.gguf | Q5_0 | 6.47 GB | Download |
| Qwen3.5-9B.Q5_K_M.gguf | Q5_K_M | 6.64 GB | Download |
| Qwen3.5-9B.Q5_K_S.gguf | Q5_K_S | 6.47 GB | Download |
| Qwen3.5-9B.Q6_K.gguf | Q6_K | 7.56 GB | Download |
| Qwen3.5-9B.Q8_0.gguf | Q8_0 | 9.79 GB | Download |
| Qwen3.5-9B.mmproj-bf16.gguf | mmproj-bf16 | 922 MB | Download |
| Qwen3.5-9B.mmproj-f16.gguf | mmproj-f16 | 922 MB | Download |
| Qwen3.5-9B.mmproj-q8_0.gguf | mmproj-q8_0 | 624 MB | Download |
Quants Usage
(sorted by size, not necessarily quality. IQ-quants are often preferable over similar sized non-IQ quants)
Here is a handy graph by ikawrakow comparing some lower-quality quant types (lower is better):
- Downloads last month
- 4,602
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
