VOOZH about

URL: https://huggingface.co/0xSero/Qwen3.5-264B-W4A16

⇱ 0xSero/Qwen3.5-264B-W4A16 · Hugging Face


Support this work → · X · GitHub · REAP paper · Cerebras REAP

Qwen3.5-264B-W4A16

W4A16 quantization of Qwen/Qwen3.5-397B-A17B.

At a glance

Base model Qwen/Qwen3.5-397B-A17B
Format W4A16
Total params 264B
Active / token
Experts / layer
Layers
Hidden size
Context
On-disk size 282 GB

Which variant should I pick?

Variant Format Link
Qwen3.5-264B BF16 link
Qwen3.5-264B-FP8 FP8 link
Qwen3.5-264B-W4A16 (this) W4A16 link
Qwen3.5-28B BF16 link
Qwen3.5-35B-EXL3-4bpw EXL3-4bpw link
Qwen3.5-76B BF16 link
Qwen3.5-76B-GGUF GGUF link
Qwen3.5-88B BF16 link
Qwen3.5-99B BF16 link
Qwen3.5-99B-GGUF GGUF link
  • Repository: 0xSero/Qwen3.5-264B-W4A16
  • Base model: Qwen/Qwen3.5-397B-A17B
  • Artifact kind: quantized
  • Compression ratio: 34%
  • Prune metric: reap
  • Quantization scheme: W4A16
  • Quantization format: auto_round:auto_gptq
  • Parent artifact: 0xSero/Qwen3.5-264B

Details

  • Maintainer: 0xSero
  • Organization: Sybil Solutions
  • Project: REAP PR17
  • Hub owner: 0xSero
  • Summary: AutoRound W4A16 GPTQ quantization of Qwen3.5-264B-REAP with vision encoder transplanted from the 262B variant.

Architecture

Hybrid MoE + Linear Attention (GDN/Mamba-style):

  • 60 layers with mixed linear_attention and full_attention layer types
  • 336 experts, 10 active per token
  • Vision encoder: ViT with 27 blocks, 1152 hidden size, spatial merge, transplanted from atbender/Qwen3.5-REAP-262B-A17B-W4A16
  • Composite multimodal format: Qwen3_5MoeForConditionalGeneration architecture

Vision Encoder

The vision encoder (visual-encoder.safetensors, 870 MB, 333 tensor keys) was transplanted from the 262B variant. The original 264B model was text-only; the vision weights are from the same Qwen3.5 architecture family and are fully compatible. Vision supports image understanding via the standard OpenAI image_url content format.

Provenance

  • Observer state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-observer-state.raw.pt
  • Detail state: /home/ubuntu/qwen397-full/observer-calibv1/qwen397-pr17-calibv1-23k-16k-detail-state.raw.pt

Benchmarks

Evaluated on 8x RTX 3090 (24 GB each) with vLLM, TP=8, expert parallel, fp8 KV cache.

Benchmark Samples Score
HumanEval (coding) 50 100%
MATH-500 (competition math) 54 89%
Reasoning & Logic 2 100%
Terminal/CLI 2 100%
SWE (bug fixing) 2 100%
Cybersecurity 2 100%
Philosophy 2 100%
MMLU (general knowledge) 2 100%

Generation speed: ~62 tokens/s at batch_size=1.

Serving with vLLM

Requirements

Python 3.12
CUDA 12.8
8x GPU with 24+ GB each (tested on RTX 3090)

Exact working dependency versions

vllm==0.19.0
torch==2.10.0+cu128
transformers==4.57.6
flashinfer-python==0.6.6
flashinfer-cubin==0.6.6
quack-kernels==0.3.10
nvidia-cutlass-dsl==4.4.2
nvidia-cutlass-dsl-libs-base==4.4.2
triton==3.6.0
xgrammar==0.1.33
conch-triton-kernels==1.3

Installation

uv venv vllm-env --python 3.12
uv pip install --python vllm-env/bin/python3 'vllm==0.19.0' conch-triton-kernels

Tokenizer fix

The tokenizer_config.json shipped with this model uses "tokenizer_class": "Qwen2Tokenizer". If you encounter tokenizer errors, verify this field is set correctly:

import json
with open("tokenizer_config.json") as f:
 cfg = json.load(f)
cfg["tokenizer_class"] = "Qwen2Tokenizer"
with open("tokenizer_config.json", "w") as f:
 json.dump(cfg, f, indent=2)

Launch command

vllm serve 0xSero/Qwen3.5-264B-W4A16 \
 --tensor-parallel-size 8 \
 --enable-expert-parallel \
 --enable-prefix-caching \
 --max-model-len 262144 \
 --max-num-seqs 4 \
 --gpu-memory-utilization 0.9 \
 --kv-cache-dtype fp8_e4m3 \
 --dtype bfloat16 \
 --trust-remote-code \
 --reasoning-parser qwen3 \
 --tool-call-parser qwen3_coder \
 --enable-auto-tool-choice \
 --served-model-name qwen35-264b

Known issues

  • Mamba cache align mode: vLLM auto-enables experimental Mamba cache "align" mode when prefix caching is on. vLLM 0.19.0 includes a fix for Mamba state corruption (PR #37728) that improves stability. If you experience hangs after sustained usage on 0.18.x, upgrade to 0.19.0.
  • PCIe riser instability: On systems with PCIe risers (e.g., mining rigs repurposed for ML), sustained multi-GPU NCCL traffic can cause AER errors. Mask AER with setpci -s <addr> ECAP_AER+0x08.l=0xFFFFFFFF on affected slots.
  • CUDA graph memory: If CUDA graph capture fails, add --max-cudagraph-capture-size 256 or --enforce-eager.

Usage

Text generation

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
 model="qwen35-264b",
 messages=[{"role": "user", "content": "Solve: what is the integral of x^2 * e^x dx?"}],
 max_tokens=8192,
)
print(response.choices[0].message.content)

Vision

import base64
response = client.chat.completions.create(
 model="qwen35-264b",
 messages=[{
 "role": "user",
 "content": [
 {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64_image}"}},
 {"type": "text", "text": "What's in this image?"}
 ]
 }],
 max_tokens=4096,
)

Tool calling

tools = [{
 "type": "function",
 "function": {
 "name": "get_weather",
 "parameters": {"type": "object", "properties": {"city": {"type": "string"}}}
 }
}]
response = client.chat.completions.create(
 model="qwen35-264b",
 messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
 tools=tools,
)

License & citation

License inherited from the base model.

@misc{lasby2025reap,
 title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
 author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
 year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
95
Safetensors
Model size
37B params
Tensor type
F32
·
I32
·
BF16
·

Model tree for 0xSero/Qwen3.5-264B-W4A16

Quantized
(66)
this model

Datasets used to train 0xSero/Qwen3.5-264B-W4A16

Space using 0xSero/Qwen3.5-264B-W4A16 1

Collections including 0xSero/Qwen3.5-264B-W4A16

Paper for 0xSero/Qwen3.5-264B-W4A16