Qwen3.6-35B-A3B-NVFP4 by IG1
Quantization
This model has been quantized using llm-compressor v0.10.1.dev107+gfdaeb6c4 (just after Qwen3.6 example was merged) and transformers v5.3.0. It is based on the official example with a few modifications (see next section).
Quantization particularities
The sequence length has been increased from 4096 to 8192 and the number of samples from 256 to 1024. The 1024 samples come from 4 differents datasets:
- 256 general conversation samples (UltraChat)
- 256 math reasoning samples (GSM8K)
- 256 code samples (CodeAlpaca)
- 256 multilingual samples (Aya)
You can find the quantization script here.
About FP8 KV cache
In our testing, the Qwen3.6 Mamba hybrid architecture did not play well with FP8 KV cache:
vLLM dynamic FP8 KV cache (
--kv-cache-dtype fp8_e4m3 --calculate-kv-scales) appeared to work initially but quality degraded rapidly into gibberish.Static FP8 scales via llm-compressor (
kv_cache_schemein the recipe) corrupted the NVFP4 weight quantization during calibration. Because FP8 is injected into the forward pass during scale computation, layers with mismatched head dimensions (256 for attention vs 128 for linear attention) produced corrupted activations that propagated through the network, poisoning the weight quantization scales. The resulting model output gibberish even when FP8 KV cache was disabled at inference — the weights themselves were permanently damaged. Note that static FP8 KV scales stored in a checkpoint are passive metadata and still require explicit activation via--kv-cache-dtype fp8_e4m3at vLLM startup to be used; however, the corruption occurred during quantization, not at inference time.
Qwen3.6 Profiles
Alongside support for dynamic thinking and non-thinking modes, the Qwen team offers 4 sampling parameter profiles:
- Thinking General
- Thinking Coding
- Instruct General
- Instruct Reasoning (we prefer to call it Instruct Creative internally)
Manually configuring these parameters for every AI client can be difficult. To solve this, we built a lightweight reverse proxy that exposes the 4 profiles as virtual model names. It handles request transformation on the fly using a single inference server as backend. View the project on our GitHub.
Inference
We run this model with vLLM, here is a sample execution command:
docker run --rm --name 'Qwen3.6-35B-A3B-NVFP4' \
--runtime=nvidia --gpus 'all' --ipc=host \
-e 'HF_TOKEN' \
-v '/srv/cache:/root/.cache' \
-p '127.0.0.1:8000:8000' \
'vllm/vllm-openai:v0.20.0' \
'ig1/Qwen3.6-35B-A3B-NVFP4' \
--served-model-name 'Qwen3.6-35B-A3B' \
--reasoning-parser 'qwen3' \
--enable-auto-tool-choice \
--tool-call-parser 'qwen3_coder' \
--max-model-len 'auto' \
--gpu-memory-utilization '0.66' \
--kv-cache-memory-bytes '21G'
A few notes about some of the parameters:
- Adapt the
/srv/cache:/root/.cachemount point to your liking. It contains files you want to keep between multiples run (dynamo bytecode and AOT with torch compile but most importantly the huggingface folder for the model) --gpu-memory-utilization '0.66' --kv-cache-memory-bytes '21G'was set to get ~1x the max model len (274,560 tokens KV cache for a 262,144 max model len) and check the total VRAM consumption once vLLM has been fully started. The--gpu-memory-utilizationis an upper bound for vLLM start (here on a RTX 6000 Pro Blackwell) before the KV cache size is fixed to 21G.- If you deploy the model into several GPUs using Tensor Parallelism, be sure to check the official recipe as others flags are needed.
With this config, vLLM consumed a total of 50,176MiB on a RTX 6000 Pro Blackwell.
Speculative Decoding (MTP)
The layers responsible for the Multi-Token Prediction has not been quantized and are available separately in the model_mtp.safetensors file. If you want to use speculative decoding simply add the following arguments (vLLM will load the file automatically):
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'
This is recommended for latency-focused serving scenarios (not total throughput/massive concurrent requests), check the vLLM recipes website for Qwen 3.5/3.6 for more informations.
RTX 5090 Optimized Deployment
Because of the increase model size from its previous versions (Qwen3-30B-A3B-Thinking-2507 and Qwen3-30B-A3B-Instruct-2507 were 30B not 35B) but also because of its mamba hybrid architecture and its native vision support (layers excluded from the quantization) the final model size is bigger (~ +5GiB) which make its inference by a RTX 5090 with only 32 GiB of RAM challenging.
If you really want/need to, it is still possible by tuning a few parameters (and accepting a lower max model len/kv cache size and requests concurrency).
With turboquant (recommended)
vLLM v0.21.0 landed support of TurbotQuant for hybrid models like Qwen3.5/3.6, drastically enhancing the available KV cache available on a limited VRAM.
With a headless environment (no graphical env) you can push the memory usage to 0.95. Without it is best to set it to 0.875 if you have graphical apps (like Zed) that use the GPU as well but you can try to push it to 0.9 beware that it might not leave the host enought.
With 0.875:
(EngineCore pid=139) INFO 05-15 15:37:22 [gpu_worker.py:462] Available KV cache memory: 1.59 GiB
...
(EngineCore pid=139) INFO 05-15 15:37:22 [kv_cache_utils.py:1871] Auto-fit max_model_len: reduced from 262144 to 210368 to fit in available GPU memory (1.59 GiB available for KV cache)
(EngineCore pid=139) INFO 05-15 15:37:22 [kv_cache_utils.py:1710] GPU KV cache size: 210,368 tokens
(EngineCore pid=139) INFO 05-15 15:37:22 [kv_cache_utils.py:1711] Maximum concurrency for 210,368 tokens per request: 1.00x
With 0.9:
(EngineCore pid=139) INFO 05-15 15:21:12 [gpu_worker.py:462] Available KV cache memory: 2.69 GiB
...
(EngineCore pid=139) INFO 05-15 15:21:12 [kv_cache_utils.py:1863] Auto-fit max_model_len: full model context length 262144 fits in available GPU memory
(EngineCore pid=139) INFO 05-15 15:21:12 [kv_cache_utils.py:1710] GPU KV cache size: 358,441 tokens
(EngineCore pid=139) INFO 05-15 15:21:12 [kv_cache_utils.py:1711] Maximum concurrency for 262,144 tokens per request: 1.37x
Linux
bash command:
docker run --name 'Qwen3.6-35B-A3B-NVFP4-KVTBQ' \
--runtime=nvidia --gpus 'all' --ipc=host \
-e 'HF_TOKEN' \
-v '/srv/cache:/root/.cache' \
-p '127.0.0.1:8000:8000' \
'vllm/vllm-openai:v0.21.0' \
'ig1/Qwen3.6-35B-A3B-NVFP4' \
--served-model-name 'Qwen3.6-35B-A3B' \
--reasoning-parser 'qwen3' \
--enable-auto-tool-choice \
--tool-call-parser 'qwen3_coder' \
--max-model-len 'auto' \
--limit-mm-per-prompt.video 0 \
--max-cudagraph-capture-size 32 \
--max-num-seqs 32 \
--max-num-batched-tokens 2048 \
--kv-cache-dtype 'turboquant_k8v4' \
--gpu-memory-utilization '0.875'
Windows with Docker and WSL
powershell command:
docker run --name 'Qwen3.6-35B-A3B-NVFP4-KVTBQ' `
--runtime=nvidia --gpus 'all' --ipc=host `
-e 'HF_TOKEN' `
-v 'E:\cache:/root/.cache' `
-p '127.0.0.1:8000:8000' `
'vllm/vllm-openai:v0.21.0' `
'ig1/Qwen3.6-35B-A3B-NVFP4' `
--served-model-name 'Qwen3.6-35B-A3B' `
--reasoning-parser 'qwen3' `
--enable-auto-tool-choice `
--tool-call-parser 'qwen3_coder' `
--max-model-len 'auto' `
--limit-mm-per-prompt.video 0 `
--max-cudagraph-capture-size 32 `
--max-num-seqs 32 `
--max-num-batched-tokens 2048 `
--kv-cache-dtype 'turboquant_k8v4' `
--gpu-memory-utilization '0.875'
Without turboquant (old, not recommended)
docker run --rm --name 'Qwen3.6-35B-A3B-NVFP4' \
--runtime=nvidia --gpus 'all' --ipc=host \
-e 'HF_TOKEN' \
-v '/srv/cache:/root/.cache' \
-p '127.0.0.1:8000:8000' \
'vllm/vllm-openai:v0.20.0' \
'ig1/Qwen3.6-35B-A3B-NVFP4' \
--served-model-name 'Qwen3.6-35B-A3B' \
--reasoning-parser 'qwen3' \
--enable-auto-tool-choice \
--tool-call-parser 'qwen3_coder' \
--max-model-len 'auto' \
--limit-mm-per-prompt.video 0 \
--max-cudagraph-capture-size 32 \
--max-num-seqs 32 \
--max-num-batched-tokens 2048 \
--gpu-memory-utilization '0.95'
Important: On a non-headless host (with graphical environment), --gpu-memory-utilization 0.95 may cause instability. Lower it if needed.
| Flag | Value | Purpose | Impact |
|---|---|---|---|
--limit-mm-per-prompt.video |
0 |
Disable video encoder | Saves ~170 MiB |
--max-cudagraph-capture-size |
≤32 |
Limit CUDA graph buffers | Saves ~4,400 MiB |
--max-num-seqs |
≤32 |
Limit concurrent sequences, can not be higher than max-cudagraph-capture-size | Included above |
--max-num-batched-tokens |
2048 |
Reduce activation buffers | Saves ~350 MiB |
About the --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}': this will speed up generation in a low concurrent users/requests scenario but the MTP layers eat ~1.69G of your VRAM decreasing the space available for the KV cache (equivalent to roughly ~26,400 tokens). Because the purpose here was to save VRAM we did not include it. But in the end, it is up to you to decide if you prefer faster inference with a lower KV cache and slower inference with a bigger KV cache.
Windows with Docker for WSL
Powershell:
docker run --name 'Qwen3.6-35B-A3B-NVFP4' `
--runtime=nvidia --gpus 'all' --ipc=host `
-e 'HF_TOKEN' `
-v 'E:\cache:/root/.cache' `
-p '127.0.0.1:8000:8000' `
'vllm/vllm-openai:v0.20.0' `
'ig1/Qwen3.6-35B-A3B-NVFP4' `
--served-model-name 'Qwen3.6-35B-A3B' `
--reasoning-parser 'qwen3' `
--enable-auto-tool-choice `
--tool-call-parser 'qwen3_coder' `
--max-model-len 'auto' `
--limit-mm-per-prompt.video 0 `
--max-cudagraph-capture-size 32 `
--max-num-seqs 32 `
--max-num-batched-tokens 2048 `
--gpu-memory-utilization '0.90'
Depending on your host VRAM usage, 0.90 might be too high. You can lower it to 0.875, but this will impact your KV cache/maximum model length.
- Downloads last month
- 1,166
Model tree for ig1/Qwen3.6-35B-A3B-NVFP4
Base model
Qwen/Qwen3.6-35B-A3B