Qwen3-Next Usage Guide¶
Qwen3-Next is an advanced large language model created by the Qwen team from Alibaba Cloud. It features several key improvements:
- A hybrid attention mechanism
- A highly sparse Mixture-of-Experts (MoE) structure
- Training-stability-friendly optimizations
- A multi-token prediction mechanism for faster inference
Installing vLLM¶
uvvenv
source.venv/bin/activate
uvpipinstall-Uvllm--torch-backendauto
Launching Qwen3-Next with vLLM¶
You can use 4x H200/H20 or 4x A100/A800 GPUs to launch this model.
Basic Multi-GPU Setup¶
vllmserveQwen/Qwen3-Next-80B-A3B-Instruct\
--tensor-parallel-size4\
--served-model-nameqwen3-next\
--enable-prefix-caching
If you encounter torch.AcceleratorError: CUDA error: an illegal memory access was encountered, you can add --compilation_config.cudagraph_mode=PIECEWISE to the startup parameters to resolve this issue. This IMA error may occur in Data Parallel (DP) mode.
For FP8 model¶
For SM90/SM100 machines:
vllmserveQwen/Qwen3-Next-80B-A3B-Instruct-FP8\
--tensor-parallel-size4\
--enable-prefix-caching
We can accelerate the performance on SM100 machines using the FP8 FlashInfer TRTLLM MoE kernel.
VLLM_USE_FLASHINFER_MOE_FP8=1\
VLLM_FLASHINFER_MOE_BACKEND=latency\
VLLM_USE_DEEP_GEMM=0\
VLLM_USE_TRTLLM_ATTENTION=0\
VLLM_ATTENTION_BACKEND=FLASH_ATTN\
vllmserveQwen/Qwen3-Next-80B-A3B-Instruct-FP8\
--tensor-parallel-size4
Advanced Configuration with MTP¶
Qwen3-Next also supports Multi-Token Prediction (MTP in short), you can launch the model server with the following arguments to enable MTP.
vllmserveQwen/Qwen3-Next-80B-A3B-Instruct\
--tokenizer-modeauto--gpu-memory-utilization0.8\
--speculative-config'{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'\
--tensor-parallel-size4--no-enable-chunked-prefill
The speculative-config argument configures speculative decoding settings using a JSON format. The method "qwen3_next_mtp" specifies that the system should use Qwen3-Next's specialized multi-token prediction method. The "num_speculative_tokens": 2 setting means the model will speculate 2 tokens ahead during generation.
Performance Metrics¶
Benchmarking¶
We use the following script to demonstrate how to benchmark Qwen/Qwen3-Next-80B-A3B-Instruct.
vllmbenchserve\
--backendvllm\
--modelQwen/Qwen3-Next-80B-A3B-Instruct\
--served-model-nameqwen3-next\
--endpoint/v1/completions\
--dataset-namerandom\
--random-input2048\
--random-output1024\
--max-concurrency10\
--num-prompt100
Usage Tips¶
Tune MoE kernel¶
When starting the model service, you may encounter the following warning in the server log(Suppose the GPU is NVIDIA_H20-3e):
(VllmWorkerTP2pid=47571)WARNING09-0915:47:25[fused_moe.py:727]UsingdefaultMoEconfig.Performancemightbesub-optimal!Configfilenotfoundat['/vllm_path/vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_H20-3e.json']
You can use benchmark_moe to perform MoE Triton kernel tuning for your hardware. Once tuning is complete, a JSON file with a name like E=512,N=128,device_name=NVIDIA_H20-3e.json will be generated. You can specify the directory containing this file for your deployment hardware using the environment variable VLLM_TUNED_CONFIG_FOLDER, like:
VLLM_TUNED_CONFIG_FOLDER=your_moe_tuned_dirvllmserveQwen/Qwen3-Next-80B-A3B-Instruct\
--tensor-parallel-size4\
--served-model-nameqwen3-next
You should see the following information printed in the server log. This indicates that the tuned MoE configuration has been loaded, which will improve the model service performance.
(VllmWorkerTP2pid=60498)INFO09-0916:23:07[fused_moe.py:720]Usingconfigurationfrom/your_moe_tuned_dir/E=512,N=128,device_name=NVIDIA_H20-3e.jsonforMoElayer.
Data Parallel Deployment¶
vLLM supports multi-parallel groups. You can refer to Data Parallel Deployment documentation and try parallel combinations that are more suitable for this model.
Function calling¶
vLLM also supports calling user-defined functions. Make sure to run your Qwen3-Next models with the following arguments.
vllmserve...--tool-call-parserhermes--enable-auto-tool-choice
AMD GPU Support¶
Recommended approaches by hardware type are:
MI300X/MI325X/MI355X
Please follow the steps here to install and run Qwen3-Next models on AMD MI300X/MI325X/MI355X GPU.
Step 1: Installing vLLM (AMD ROCm Backend: MI300X, MI325X, MI355X)¶
Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the documentation.
uvvenv source.venv/bin/activate uvpipinstallvllm--extra-index-urlhttps://wheels.vllm.ai/rocm/0.14.1/rocm700
Step 2: Start the vLLM server¶
Run the vllm online serving
SAFETENSORS_FAST_GPU=1\
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1\
vllmserveQwen/Qwen3-Next-80B-A3B-Instruct\
--tensor-parallel-size4\
--max-model-len32768\
--no-enable-prefix-caching\
--trust-remote-code
Step 3: Run Benchmark¶
Open a new terminal and run the following command to execute the benchmark script inside the container.
vllmbenchserve\
--model"Qwen/Qwen3-Next-80B-A3B-Instruct"\
--dataset-namerandom\
--random-input-len8192\
--random-output-len1024\
--request-rate10000\
--num-prompts16\
--ignore-eos\
--trust-remote-code
