DeepSeek-V3.2-Exp Usage Guide¶
Introduction¶
DeepSeek-V3.2-Exp is a sparse attention model. The main architecture is similar to DeepSeek-V3.1, but with a sparse attention mechanism.
Installing vLLM¶
source.venv/bin/activate
uvpipinstall-Uvllm--torch-backendauto
uvpipinstallgit+https://github.com/deepseek-ai/[email protected]--no-build-isolation# Other versions may also work. We recommend using the latest released version from https://github.com/deepseek-ai/DeepGEMM/releases
Note: DeepGEMM is used in two places: MoE and MQA logits computation. It is necessary for MQA logits computation. If you want to disable the MoE part, you can set VLLM_USE_DEEP_GEMM=0 in the environment variable. Some users reported that the performance is better with VLLM_USE_DEEP_GEMM=0, e.g. on H20 GPUs. It might be also beneficial to disable DeepGEMM if you want to skip the long warmup.
Launching DeepSeek-V3.2-Exp¶
Serving on 8xH200 (or H20) GPUs (141GB × 8)¶
Using the recommended EP/DP mode:
vllmservedeepseek-ai/DeepSeek-V3.2-Exp-dp8--enable-expert-parallel
Using tensor parallel:
vllmservedeepseek-ai/DeepSeek-V3.2-Exp-tp8
Serving on 8xB200 GPUs¶
Same as the above.
Only Hopper and Blackwell data center GPUs are supported for now.
Accuracy Benchmarking:¶
lm-eval--modellocal-completions--tasksgsm8k--model_argsmodel=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False
Results:
local-completions(model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False),gen_kwargs:(None),limit:None,num_fewshot:None,batch_size:1
|Tasks|Version|Filter|n-shot|Metric||Value||Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|3|flexible-extract|5|exact_match|↑|0.9591|±|0.0055|
|||strict-match|5|exact_match|↑|0.9591|±|0.0055|
GSM8K score 0.9591 is pretty good!
And then we can use num_fewshot=20 to increase the context length, testing if the model can handle longer context:
lm-eval--modellocal-completions--tasksgsm8k--model_argsmodel=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False--num_fewshot20
Results:
local-completions(model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False),gen_kwargs:(None),limit:None,num_fewshot:20,batch_size:1
|Tasks|Version|Filter|n-shot|Metric||Value||Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|3|flexible-extract|20|exact_match|↑|0.9538|±|0.0058|
|||strict-match|20|exact_match|↑|0.9530|±|0.0058|
GSM8K score 0.9538 is also pretty good!
Performance Tips¶
- The kernels are mainly optimized for TP=1, so it is recommended to run this model under EP/DP mode, i.e. DP=8, EP=8, TP=1 as shown above. If you hit any errors or hangs, try tensor parallel instead. Simple tensor parallel works and is more robust, but the performance is not optimal.
- The default config uses a custom
fp8kvcache. You can also usebfloat16kvcache by specifyingkv_cache_dtype=bfloat16. The default case allows more tokens to be cached in the kvcache, but incurs additional quantization/dequantization overhead. In general, we recommend usingbfloat16kvcache for short requests, andfp8kvcache for long requests.
If you hit some errors like CUDA error (flashmla-src/csrc/smxx/mla_combine.cu:201): invalid configuration argument, it might be caused by too large batchsize. Try with --max-num-seqs 256 or smaller (the default is 1024).
For other usage tips, such as enabling or disabling thinking mode, please refer to the DeepSeek-V3.1 Usage Guide.
