Support this work → · X · GitHub · REAP paper · Cerebras REAP

Qwen3.5-264B-W4A16

W4A16 quantization of Qwen/Qwen3.5-397B-A17B.

At a glance


Base model	Qwen/Qwen3.5-397B-A17B
Format	W4A16
Total params	264B
Active / token	—
Experts / layer	—
Layers	—
Hidden size	—
Context	—
On-disk size	282 GB

Which variant should I pick?

Variant	Format	Link
`Qwen3.5-264B`	BF16	link
`Qwen3.5-264B-FP8`	FP8	link
`Qwen3.5-264B-W4A16` (this)	W4A16	link
`Qwen3.5-28B`	BF16	link
`Qwen3.5-35B-EXL3-4bpw`	EXL3-4bpw	link
`Qwen3.5-76B`	BF16	link
`Qwen3.5-76B-GGUF`	GGUF	link
`Qwen3.5-88B`	BF16	link
`Qwen3.5-99B`	BF16	link
`Qwen3.5-99B-GGUF`	GGUF	link

Repository: 0xSero/Qwen3.5-264B-W4A16
Base model: Qwen/Qwen3.5-397B-A17B
Artifact kind: quantized
Compression ratio: 34%
Prune metric: reap
Quantization scheme: W4A16
Quantization format: auto_round:auto_gptq
Parent artifact: 0xSero/Qwen3.5-264B

Details

Maintainer: 0xSero
Organization: Sybil Solutions
Project: REAP PR17
Hub owner: 0xSero
Summary: AutoRound W4A16 GPTQ quantization of Qwen3.5-264B-REAP with vision encoder transplanted from the 262B variant.

Architecture

Hybrid MoE + Linear Attention (GDN/Mamba-style):

60 layers with mixed linear_attention and full_attention layer types
336 experts, 10 active per token
Vision encoder: ViT with 27 blocks, 1152 hidden size, spatial merge, transplanted from atbender/Qwen3.5-REAP-262B-A17B-W4A16
Composite multimodal format: Qwen3_5MoeForConditionalGeneration architecture

Vision Encoder

The vision encoder (visual-encoder.safetensors, 870 MB, 333 tensor keys) was transplanted from the 262B variant. The original 264B model was text-only; the vision weights are from the same Qwen3.5 architecture family and are fully compatible. Vision supports image understanding via the standard OpenAI image_url content format.

Provenance

Observer state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-observer-state.raw.pt
Detail state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-detail-state.raw.pt

Benchmarks

Evaluated on 8x RTX 3090 (24 GB each) with vLLM, TP=8, expert parallel, fp8 KV cache.

Benchmark	Samples	Score
HumanEval (coding)	50	100%
MATH-500 (competition math)	54	89%
Reasoning & Logic	2	100%
Terminal/CLI	2	100%
SWE (bug fixing)	2	100%
Cybersecurity	2	100%
Philosophy	2	100%
MMLU (general knowledge)	2	100%

Generation speed: ~62 tokens/s at batch_size=1.

Serving with vLLM

Requirements

Python 3.12
CUDA 12.8
8x GPU with 24+ GB each (tested on RTX 3090)

Exact working dependency versions

vllm==0.19.0
torch==2.10.0+cu128
transformers==4.57.6
flashinfer-python==0.6.6
flashinfer-cubin==0.6.6
quack-kernels==0.3.10
nvidia-cutlass-dsl==4.4.2
nvidia-cutlass-dsl-libs-base==4.4.2
triton==3.6.0
xgrammar==0.1.33
conch-triton-kernels==1.3

Installation

uv venv vllm-env --python 3.12
uv pip install --python vllm-env/bin/python3 'vllm==0.19.0' conch-triton-kernels

Tokenizer fix

The tokenizer_config.json shipped with this model uses "tokenizer_class": "Qwen2Tokenizer". If you encounter tokenizer errors, verify this field is set correctly:

import json
with open("tokenizer_config.json") as f:
 cfg = json.load(f)
cfg["tokenizer_class"] = "Qwen2Tokenizer"
with open("tokenizer_config.json", "w") as f:
 json.dump(cfg, f, indent=2)

Launch command

vllm serve 0xSero/Qwen3.5-264B-W4A16 \
 --tensor-parallel-size 8 \
 --enable-expert-parallel \
 --enable-prefix-caching \
 --max-model-len 262144 \
 --max-num-seqs 4 \
 --gpu-memory-utilization 0.9 \
 --kv-cache-dtype fp8_e4m3 \
 --dtype bfloat16 \
 --trust-remote-code \
 --reasoning-parser qwen3 \
 --tool-call-parser qwen3_coder \
 --enable-auto-tool-choice \
 --served-model-name qwen35-264b

Known issues

Mamba cache align mode: vLLM auto-enables experimental Mamba cache "align" mode when prefix caching is on. vLLM 0.19.0 includes a fix for Mamba state corruption (PR #37728) that improves stability. If you experience hangs after sustained usage on 0.18.x, upgrade to 0.19.0.
PCIe riser instability: On systems with PCIe risers (e.g., mining rigs repurposed for ML), sustained multi-GPU NCCL traffic can cause AER errors. Mask AER with setpci -s <addr> ECAP_AER+0x08.l=0xFFFFFFFF on affected slots.
CUDA graph memory: If CUDA graph capture fails, add --max-cudagraph-capture-size 256 or --enforce-eager.

Usage

Text generation

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
 model="qwen35-264b",
 messages=[{"role": "user", "content": "Solve: what is the integral of x^2 * e^x dx?"}],
 max_tokens=8192,
)
print(response.choices[0].message.content)

Vision

import base64
response = client.chat.completions.create(
 model="qwen35-264b",
 messages=[{
 "role": "user",
 "content": [
 {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64_image}"}},
 {"type": "text", "text": "What's in this image?"}
 ]
 }],
 max_tokens=4096,
)

Tool calling

tools = [{
 "type": "function",
 "function": {
 "name": "get_weather",
 "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
 }
}]
response = client.chat.completions.create(
 model="qwen35-264b",
 messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
 tools=tools,
)

License & citation

License inherited from the base model.

@misc{lasby2025reap,
 title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
 author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
 year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Model tree for 0xSero/Qwen3.5-264B-W4A16

Base model

Qwen/Qwen3.5-397B-A17B

Quantized

(66)

this model

Datasets used to train 0xSero/Qwen3.5-264B-W4A16

Space using 0xSero/Qwen3.5-264B-W4A16 1

Collections including 0xSero/Qwen3.5-264B-W4A16

Benchmarked REAP checkpoints with >=500 all-time downloads. GLM/Qwen/MiniMax/DeepSeek/Kimi/gemma. • 20 items • Updated 18 days ago • 10

REAP-pruned & quantized Qwen3.5 / 3.6 / Coder variants. • 15 items • Updated 20 days ago

Paper for 0xSero/Qwen3.5-264B-W4A16

Paper • 2510.13999 • Published Oct 15, 2025 • 20

URL: https://huggingface.co/0xSero/Qwen3.5-264B-W4A16

⇱ 0xSero/Qwen3.5-264B-W4A16 · Hugging Face