Qwen3-Next Usage Guide¶

Qwen3-Next is an advanced large language model created by the Qwen team from Alibaba Cloud. It features several key improvements:

A hybrid attention mechanism
A highly sparse Mixture-of-Experts (MoE) structure
Training-stability-friendly optimizations
A multi-token prediction mechanism for faster inference

Installing vLLM¶

uvvenv
source.venv/bin/activate
uvpipinstall-Uvllm--torch-backendauto

Launching Qwen3-Next with vLLM¶

You can use 4x H200/H20 or 4x A100/A800 GPUs to launch this model.

Basic Multi-GPU Setup¶

vllmserveQwen/Qwen3-Next-80B-A3B-Instruct\
--tensor-parallel-size4\
--served-model-nameqwen3-next\
--enable-prefix-caching

If you encounter torch.AcceleratorError: CUDA error: an illegal memory access was encountered, you can add --compilation_config.cudagraph_mode=PIECEWISE to the startup parameters to resolve this issue. This IMA error may occur in Data Parallel (DP) mode.

For FP8 model¶

For SM90/SM100 machines:

vllmserveQwen/Qwen3-Next-80B-A3B-Instruct-FP8\
--tensor-parallel-size4\
--enable-prefix-caching

We can accelerate the performance on SM100 machines using the FP8 FlashInfer TRTLLM MoE kernel.

VLLM_USE_FLASHINFER_MOE_FP8=1\
VLLM_FLASHINFER_MOE_BACKEND=latency\
VLLM_USE_DEEP_GEMM=0\
VLLM_USE_TRTLLM_ATTENTION=0\
VLLM_ATTENTION_BACKEND=FLASH_ATTN\
vllmserveQwen/Qwen3-Next-80B-A3B-Instruct-FP8\
--tensor-parallel-size4

Advanced Configuration with MTP¶

Qwen3-Next also supports Multi-Token Prediction (MTP in short), you can launch the model server with the following arguments to enable MTP.

vllmserveQwen/Qwen3-Next-80B-A3B-Instruct\
--tokenizer-modeauto--gpu-memory-utilization0.8\
--speculative-config'{"method": "qwen3_next_mtp", "num_speculative_tokens": 2}'\
--tensor-parallel-size4--no-enable-chunked-prefill

The speculative-config argument configures speculative decoding settings using a JSON format. The method "qwen3_next_mtp" specifies that the system should use Qwen3-Next's specialized multi-token prediction method. The "num_speculative_tokens": 2 setting means the model will speculate 2 tokens ahead during generation.

Performance Metrics¶

Benchmarking¶

We use the following script to demonstrate how to benchmark Qwen/Qwen3-Next-80B-A3B-Instruct.

vllmbenchserve\
--backendvllm\
--modelQwen/Qwen3-Next-80B-A3B-Instruct\
--served-model-nameqwen3-next\
--endpoint/v1/completions\
--dataset-namerandom\
--random-input2048\
--random-output1024\
--max-concurrency10\
--num-prompt100

Usage Tips¶

Tune MoE kernel¶

When starting the model service, you may encounter the following warning in the server log(Suppose the GPU is NVIDIA_H20-3e):

(VllmWorkerTP2pid=47571)WARNING09-0915:47:25[fused_moe.py:727]UsingdefaultMoEconfig.Performancemightbesub-optimal!Configfilenotfoundat['/vllm_path/vllm/model_executor/layers/fused_moe/configs/E=512,N=128,device_name=NVIDIA_H20-3e.json']

You can use benchmark_moe to perform MoE Triton kernel tuning for your hardware. Once tuning is complete, a JSON file with a name like E=512,N=128,device_name=NVIDIA_H20-3e.json will be generated. You can specify the directory containing this file for your deployment hardware using the environment variable VLLM_TUNED_CONFIG_FOLDER, like:

VLLM_TUNED_CONFIG_FOLDER=your_moe_tuned_dirvllmserveQwen/Qwen3-Next-80B-A3B-Instruct\
--tensor-parallel-size4\
--served-model-nameqwen3-next

You should see the following information printed in the server log. This indicates that the tuned MoE configuration has been loaded, which will improve the model service performance.

(VllmWorkerTP2pid=60498)INFO09-0916:23:07[fused_moe.py:720]Usingconfigurationfrom/your_moe_tuned_dir/E=512,N=128,device_name=NVIDIA_H20-3e.jsonforMoElayer.

Data Parallel Deployment¶

vLLM supports multi-parallel groups. You can refer to Data Parallel Deployment documentation and try parallel combinations that are more suitable for this model.

Function calling¶

vLLM also supports calling user-defined functions. Make sure to run your Qwen3-Next models with the following arguments.

vllmserve...--tool-call-parserhermes--enable-auto-tool-choice

AMD GPU Support¶

Recommended approaches by hardware type are:

MI300X/MI325X/MI355X

Please follow the steps here to install and run Qwen3-Next models on AMD MI300X/MI325X/MI355X GPU.

Step 1: Installing vLLM (AMD ROCm Backend: MI300X, MI325X, MI355X)¶

Note: The vLLM wheel for ROCm requires Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment does not meet these requirements, please use the Docker-based setup as described in the documentation.
uvvenv
source.venv/bin/activate
uvpipinstallvllm--extra-index-urlhttps://wheels.vllm.ai/rocm/0.14.1/rocm700

Step 2: Start the vLLM server¶

Run the vllm online serving

SAFETENSORS_FAST_GPU=1\
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1\
vllmserveQwen/Qwen3-Next-80B-A3B-Instruct\
--tensor-parallel-size4\
--max-model-len32768\
--no-enable-prefix-caching\
--trust-remote-code

Step 3: Run Benchmark¶

Open a new terminal and run the following command to execute the benchmark script inside the container.

vllmbenchserve\
--model"Qwen/Qwen3-Next-80B-A3B-Instruct"\
--dataset-namerandom\
--random-input-len8192\
--random-output-len1024\
--request-rate10000\
--num-prompts16\
--ignore-eos\
--trust-remote-code

URL: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3-Next.html