VOOZH about

URL: https://docs.vllm.ai/projects/ascend/en/v0.13.0/tutorials/DeepSeek-V4.html

โ‡ฑ DeepSeek-V4 โ€” vllm-ascend


DeepSeek-V4

Contents

DeepSeek-V4#

Introduction#

DeepSeek-V4 is introducing several key upgrades over DeepSeek-V3. (Currently, vllm-ascend temporarily only supports DeepSeek-V4-FLASH)

  • The Manifold-Constrained Hyper-Connections (mHC) to strengthen conventional residual connections;

  • A hybrid attention architecture, which greatly improves long-context efficiency through Compress-4-Attention and Compress-128-Attention. For the Mixture-of Experts (MoE) components, it still adopt the DeepSeekMoE architecture, with only minor adjustments.

This document will show the main verification steps of the model, including supported features, feature configuration, environment preparation, single-node and multi-node deployment, accuracy and performance evaluation.

Environment Preparation#

Model Weight#

  • DeepSeek-V4-Flash-w8a8-mtp(Quantized version): require 1 Atlas 800 A3 (128G ร— 8) node or 1 Atlas 800 A2 (64G ร— 8) node. Download model weight

It is recommended to download the model weight to the shared directory of multiple nodes, such as /root/.cache/

Verify Multi-node Communication(Optional)#

If you want to deploy multi-node environment, you need to verify multi-node communication according to verify multi-node communication environment.

Installation#

You can using our official docker image to run DeepSeek-V4 directly. Currently, DeepSeek-V4 is integrated in image v0.13.0rc3.

Start the docker image on your each node.

exportIMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc3
exportNAME=vllm-ascend
dockerrun--rm\
--name$NAME\
--net=host\
--shm-size=1g\
--device/dev/davinci0\
--device/dev/davinci1\
--device/dev/davinci2\
--device/dev/davinci3\
--device/dev/davinci4\
--device/dev/davinci5\
--device/dev/davinci6\
--device/dev/davinci7\
--device/dev/davinci_manager\
--device/dev/devmm_svm\
--device/dev/hisi_hdc\
-v/usr/local/dcmi:/usr/local/dcmi\
-v/usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool\
-v/usr/local/bin/npu-smi:/usr/local/bin/npu-smi\
-v/usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/\
-v/usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info\
-v/etc/ascend_install.info:/etc/ascend_install.info\
-v/etc/hccn.conf:/etc/hccn.conf\
-v/mnt/sfs_turbo/.cache:/root/.cache\
-it$IMAGEbash

Start the docker image on your each node.

exportIMAGE=quay.io/ascend/vllm-ascend:v0.13.0rc3-a3
exportNAME=vllm-ascend
dockerrun--rm\
--name$NAME\
--net=host\
--shm-size=1g\
--device/dev/davinci0\
--device/dev/davinci1\
--device/dev/davinci2\
--device/dev/davinci3\
--device/dev/davinci4\
--device/dev/davinci5\
--device/dev/davinci6\
--device/dev/davinci7\
--device/dev/davinci8\
--device/dev/davinci9\
--device/dev/davinci10\
--device/dev/davinci11\
--device/dev/davinci12\
--device/dev/davinci13\
--device/dev/davinci14\
--device/dev/davinci15\
--device/dev/davinci_manager\
--device/dev/devmm_svm\
--device/dev/hisi_hdc\
-v/usr/local/dcmi:/usr/local/dcmi\
-v/usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool\
-v/usr/local/bin/npu-smi:/usr/local/bin/npu-smi\
-v/usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/\
-v/usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info\
-v/etc/ascend_install.info:/etc/ascend_install.info\
-v/etc/hccn.conf:/etc/hccn.conf\
-v/mnt/sfs_turbo/.cache:/root/.cache\
-it$IMAGEbash

In addition, if you donโ€™t want to use the docker image as above, you can also build all from source:

  • Install vllm-ascend from source, refer to installation. If you want to deploy multi-node environment, you need to set up environment on each node.

Note

Please use the v0.13.0rc3 code to install vllm-ascend.

Deployment#

Note

In this tutorial, we suppose you downloaded the model weight to /root/.cache/. Feel free to change it to your own path.

Single-node Deployment#

  • DeepSeek-V4-Flash-w8a8-mtp: can be deployed on 1 Atlas 800 A3 (128G ร— 8) or 1 Atlas 800 A2 (64G ร— 8).

Run the following scripts on each node respectively.

Run the following script to execute online inference.

exportUSE_MULTI_BLOCK_POOL=1
exportOMP_PROC_BIND=false
exportOMP_NUM_THREADS=10
exportPYTORCH_NPU_ALLOC_CONF=expandable_segments:True
exportACL_OP_INIT_MODE=1
exportTRITON_ALL_BLOCKS_PARALLEL=1

vllmserve/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp\
--host0.0.0.0\
--max_model_len65536\
--max-num-batched-tokens8192\
--served-model-nameds\
--gpu-memory-utilization0.9\
--max-num-seqs16\
--data-parallel-size1\
--tensor-parallel-size8\
--enable-expert-parallel\
--quantizationascend\
--port8006\
--block-size128\
--chat-template/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja\
--async-scheduling\
--additional-config'{"enable_cpu_binding": "true", "multistream_overlap_shared_expert": true}'\
--speculative-config'{"num_speculative_tokens": 1,"method": "deepseek_mtp"}'\
--compilation-config'{"cudagraph_mode":"FULL_DECODE_ONLY"}'

Run the following script to execute online inference.

exportOMP_PROC_BIND=false
exportOMP_NUM_THREADS=10
exportPYTORCH_NPU_ALLOC_CONF=expandable_segments:True
exportACL_OP_INIT_MODE=1
exportASCEND_A3_ENABLE=1
exportUSE_MULTI_BLOCK_POOL=1
exportHCCL_BUFFSIZE=1024
exportVLLM_ASCEND_ENABLE_FUSED_MC2=1
exportVLLM_ASCEND_ENABLE_FLASHCOMM1=1

vllmserve/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp\
--host0.0.0.0\
--max_model_len65536\
--max-num-batched-tokens8192\
--served-model-namedeepseek_v4\
--gpu-memory-utilization0.9\
--max-num-seqs16\
--data-parallel-size2\
--tensor-parallel-size8\
--enable-expert-parallel\
--quantizationascend\
--chat-template/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja\
--port8005\
--block-size128\
--async-scheduling\
--compilation-config'{"cudagraph_mode": "FULL_DECODE_ONLY"}'\
--speculative-config'{"num_speculative_tokens": 1,"method": "deepseek_mtp"}'\
--additional-config'{"enable_cpu_binding": "true","multistream_overlap_shared_expert": false}'

Multi-node Deployment#

  • DeepSeek-V4-Flash-w8a8-mtp: require at least 2 Atlas 800 A2 (64G ร— 8). Run the following scripts on two nodes respectively.

Node0

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

exportHCCL_OP_EXPANSION_MODE="AIV"

exportHCCL_IF_IP=$local_ip
exportGLOO_SOCKET_IFNAME=$nic_name
exportTP_SOCKET_IFNAME=$nic_name
exportHCCL_SOCKET_IFNAME=$nic_name
exportOMP_PROC_BIND=false
exportOMP_NUM_THREADS=10
exportHCCL_BUFFSIZE=200
exportPYTORCH_NPU_ALLOC_CONF=expandable_segments:True
exportHCCL_CONNECT_TIMEOUT=120
exportHCCL_INTRA_PCIE_ENABLE=1
exportHCCL_INTRA_ROCE_ENABLE=0
exportACL_OP_INIT_MODE=1
exportTRITON_ALL_BLOCKS_PARALLEL=1
exportUSE_MULTI_BLOCK_POOL=1

vllmserve/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp\
--host0.0.0.0\
--port8005\
--data-parallel-size2\
--data-parallel-size-local1\
--data-parallel-address$node0_ip\
--data-parallel-rpc-port13389\
--tensor-parallel-size8\
--quantizationascend\
--seed1024\
--served-model-namedeepseek-v4-flash\
--enable-expert-parallel\
--max-num-seqs64\
--max-model-len131072\
--max-num-batched-tokens8192\
--trust-remote-code\
--async-scheduling\
--no-enable-prefix-caching\
--chat-template/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja\
--gpu-memory-utilization0.94\
--compilation-config'{"cudagraph_mode": "FULL_DECODE_ONLY"}'\
--additional-config'{"enable_cpu_binding": "true", "multistream_overlap_shared_expert": true}'\
--speculative-config'{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

Node1

# this obtained through ifconfig
# nic_name is the network interface name corresponding to local_ip of the current node
nic_name="xxx"
local_ip="xxx"

# The value of node0_ip must be consistent with the value of local_ip set in node0 (master node)
node0_ip="xxxx"

exportHCCL_OP_EXPANSION_MODE="AIV"

exportHCCL_IF_IP=$local_ip
exportGLOO_SOCKET_IFNAME=$nic_name
exportTP_SOCKET_IFNAME=$nic_name
exportHCCL_SOCKET_IFNAME=$nic_name
exportOMP_PROC_BIND=false
exportOMP_NUM_THREADS=10
exportHCCL_BUFFSIZE=200
exportPYTORCH_NPU_ALLOC_CONF=expandable_segments:True
exportHCCL_CONNECT_TIMEOUT=120
exportHCCL_INTRA_PCIE_ENABLE=1
exportHCCL_INTRA_ROCE_ENABLE=0
exportACL_OP_INIT_MODE=1
exportTRITON_ALL_BLOCKS_PARALLEL=1
exportUSE_MULTI_BLOCK_POOL=1

vllmserve/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp\
--host0.0.0.0\
--port8005\
--headless\
--data-parallel-size2\
--data-parallel-size-local1\
--data-parallel-start-rank1\
--data-parallel-address$node0_ip\
--data-parallel-rpc-port13389\
--tensor-parallel-size8\
--quantizationascend\
--seed1024\
--served-model-namedeepseek-v4-flash\
--enable-expert-parallel\
--max-num-seqs64\
--max-model-len131072\
--max-num-batched-tokens8192\
--trust-remote-code\
--async-scheduling\
--no-enable-prefix-caching\
--gpu-memory-utilization0.94\
--chat-template/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja\
--compilation-config'{"cudagraph_mode": "FULL_DECODE_ONLY"}'\
--additional-config'{"enable_cpu_binding": "true", "multistream_overlap_shared_expert": true}'\
--speculative-config'{"num_speculative_tokens": 3, "method": "deepseek_mtp"}'

Prefill-Decode Disaggregation#

Weโ€™d like to show the deployment guide of DeepSeek-V4 on Atlas 800 A3 (128G ร— 8) multi-node environment with 2P1D for better performance.

Before you start, please

  1. prepare the script launch_online_dp.py on each node.

    importargparse
    importmultiprocessing
    importos
    importsubprocess
    importsys
    
    defparse_args():
     parser = argparse.ArgumentParser()
     parser.add_argument(
     "--dp-size",
     type=int,
     required=True,
     help="Data parallel size."
     )
     parser.add_argument(
     "--tp-size",
     type=int,
     default=1,
     help="Tensor parallel size."
     )
     parser.add_argument(
     "--dp-size-local",
     type=int,
     default=-1,
     help="Local data parallel size."
     )
     parser.add_argument(
     "--dp-rank-start",
     type=int,
     default=0,
     help="Starting rank for data parallel."
     )
     parser.add_argument(
     "--dp-address",
     type=str,
     required=True,
     help="IP address for data parallel master node."
     )
     parser.add_argument(
     "--dp-rpc-port",
     type=str,
     default=12345,
     help="Port for data parallel master node."
     )
     parser.add_argument(
     "--vllm-start-port",
     type=int,
     default=9000,
     help="Starting port for the engine."
     )
     return parser.parse_args()
    
    args = parse_args()
    dp_size = args.dp_size
    tp_size = args.tp_size
    dp_size_local = args.dp_size_local
    if dp_size_local == -1:
     dp_size_local = dp_size
    dp_rank_start = args.dp_rank_start
    dp_address = args.dp_address
    dp_rpc_port = args.dp_rpc_port
    vllm_start_port = args.vllm_start_port
    
    defrun_command(visiable_devices, dp_rank, vllm_engine_port):
     command = [
     "bash",
     "./run_dp_template.sh",
     visiable_devices,
     str(vllm_engine_port),
     str(dp_size),
     str(dp_rank),
     dp_address,
     dp_rpc_port,
     str(tp_size),
     ]
     subprocess.run(command, check=True)
    
    if __name__ == "__main__":
     template_path = "./run_dp_template.sh"
     if not os.path.exists(template_path):
     print(f"Template file {template_path} does not exist.")
     sys.exit(1)
    
     processes = []
     num_cards = dp_size_local * tp_size
     for i in range(dp_size_local):
     dp_rank = dp_rank_start + i
     vllm_engine_port = vllm_start_port + i
     visiable_devices = ",".join(str(x) for x in range(i * tp_size, (i + 1) * tp_size))
     process = multiprocessing.Process(target=run_command,
     args=(visiable_devices, dp_rank,
     vllm_engine_port))
     processes.append(process)
     process.start()
    
     for process in processes:
     process.join()
    
    
  2. prepare the script run_dp_template.sh on each node.

    1. Prefill node 1

      nic_name="xxxx"# change to your own nic name
      local_ip=xx.xx.xx.1# change to your own ip
      
      exportHCCL_OP_EXPANSION_MODE="AIV"
      
      exportHCCL_IF_IP=$local_ip
      exportGLOO_SOCKET_IFNAME=$nic_name
      exportTP_SOCKET_IFNAME=$nic_name
      exportHCCL_SOCKET_IFNAME=$nic_name
      
      exportVLLM_RPC_TIMEOUT=3600000
      exportVLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
      exportHCCL_EXEC_TIMEOUT=204
      exportHCCL_CONNECT_TIMEOUT=120
      
      exportOMP_PROC_BIND=false
      exportOMP_NUM_THREADS=10
      exportPYTORCH_NPU_ALLOC_CONF=expandable_segments:True
      exportHCCL_BUFFSIZE=2560
      exportTASK_QUEUE_ENABLE=1
      
      exportASCEND_BUFFER_POOL=4:8
      exportLD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
      exportUSE_MULTI_BLOCK_POOL=1
      
      exportASCEND_RT_VISIBLE_DEVICES=$1
      
      vllmserve/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp\
      --host0.0.0.0\
      --port$2\
      --data-parallel-size$3\
      --data-parallel-rank$4\
      --data-parallel-address$5\
      --data-parallel-rpc-port$6\
      --tensor-parallel-size$7\
      --enable-expert-parallel\
      --seed1024\
      --served-model-namedeepseek_v4\
      --max-model-len65536\
      --max-num-batched-tokens8192\
      --max-num-seqs4\
      --no-disable-hybrid-kv-cache-manager\
      --no-enable-prefix-caching\
      --trust-remote-code\
      --gpu-memory-utilization0.85\
      --quantizationascend\
      --chat-template/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja\
      --speculative-config'{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'\
      --enforce-eager\
      --additional_config'{"enable_cpu_binding": "true"}'\
      --kv-transfer-config\
      '{"kv_connector": "MooncakeConnectorV1",
       "kv_role": "kv_producer",
       "kv_port": "30000",
       "engine_id": "0",
       "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
       "kv_connector_extra_config": {
       "prefill": {
       "dp_size": 16,
       "tp_size": 1
       },
       "decode": {
       "dp_size": 32,
       "tp_size": 1
       }
       }
       }'
      
    2. Prefill node 2

      nic_name="xxxx"# change to your own nic name
      local_ip=xx.xx.xx.2# change to your own ip
      
      exportHCCL_OP_EXPANSION_MODE="AIV"
      
      exportHCCL_IF_IP=$local_ip
      exportGLOO_SOCKET_IFNAME=$nic_name
      exportTP_SOCKET_IFNAME=$nic_name
      exportHCCL_SOCKET_IFNAME=$nic_name
      
      exportVLLM_RPC_TIMEOUT=3600000
      exportVLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
      exportHCCL_EXEC_TIMEOUT=204
      exportHCCL_CONNECT_TIMEOUT=120
      
      exportOMP_PROC_BIND=false
      exportOMP_NUM_THREADS=10
      exportPYTORCH_NPU_ALLOC_CONF=expandable_segments:True
      exportHCCL_BUFFSIZE=2560
      exportTASK_QUEUE_ENABLE=1
      
      exportASCEND_BUFFER_POOL=4:8
      exportLD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
      exportUSE_MULTI_BLOCK_POOL=1
      
      exportASCEND_RT_VISIBLE_DEVICES=$1
      
      vllmserve/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp\
      --host0.0.0.0\
      --port$2\
      --data-parallel-size$3\
      --data-parallel-rank$4\
      --data-parallel-address$5\
      --data-parallel-rpc-port$6\
      --tensor-parallel-size$7\
      --enable-expert-parallel\
      --seed1024\
      --served-model-namedeepseek_v4\
      --max-model-len65536\
      --max-num-batched-tokens8192\
      --max-num-seqs4\
      --no-disable-hybrid-kv-cache-manager\
      --no-enable-prefix-caching\
      --trust-remote-code\
      --gpu-memory-utilization0.85\
      --quantizationascend\
      --chat-template/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja\
      --speculative-config'{"num_speculative_tokens": 1, "method":"deepseek_mtp"}'\
      --enforce-eager\
      --additional_config'{"enable_cpu_binding": "true"}'\
      --kv-transfer-config\
      '{"kv_connector": "MooncakeConnectorV1",
       "kv_role": "kv_producer",
       "kv_port": "30100",
       "engine_id": "1",
       "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
       "kv_connector_extra_config": {
       "prefill": {
       "dp_size": 16,
       "tp_size": 1
       },
       "decode": {
       "dp_size": 32,
       "tp_size": 1
       }
       }
       }'
      
    3. Decode node (Same as another D node)

      nic_name="xxxx"# change to your own nic name
      local_ip=xx.xx.xx.xx# change to your own ip
      
      exportLD_PRELOAD=/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOAD
      exportHCCL_OP_EXPANSION_MODE="AIV"
      exportTASK_QUEUE_ENABLE=1
      
      exportVLLM_RPC_TIMEOUT=3600000
      exportVLLM_EXECUTE_MODEL_TIMEOUT_SECONDS=30000
      exportHCCL_EXEC_TIMEOUT=2000
      exportHCCL_CONNECT_TIMEOUT=1200
      
      exportHCCL_IF_IP=$local_ip
      exportGLOO_SOCKET_IFNAME=$nic_name
      exportTP_SOCKET_IFNAME=$nic_name
      exportHCCL_SOCKET_IFNAME=$nic_name
      
      exportOMP_PROC_BIND=false
      exportOMP_NUM_THREADS=10
      exportPYTORCH_NPU_ALLOC_CONF=expandable_segments:True
      exportHCCL_BUFFSIZE=1024
      exportASCEND_BUFFER_POOL=4:8
      
      exportUSE_MULTI_BLOCK_POOL=1
      exportVLLM_ASCEND_ENABLE_FUSED_MC2=1
      exportASCEND_RT_VISIBLE_DEVICES=$1
      
      vllmserve/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp\
      --host0.0.0.0\
      --port$2\
      --data-parallel-size$3\
      --data-parallel-rank$4\
      --data-parallel-address$5\
      --data-parallel-rpc-port$6\
      --tensor-parallel-size$7\
      --enable-expert-parallel\
      --seed1024\
      --served-model-namedeepseek_v4\
      --max-model-len65536\
      --max-num-batched-tokens144\
      --max-num-seqs48\
      --async-scheduling\
      --no-disable-hybrid-kv-cache-manager\
      --no-enable-prefix-caching\
      --trust-remote-code\
      --gpu-memory-utilization0.88\
      --quantizationascend\
      --chat-template/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp/chat_template.jinja\
      --speculative-config'{"num_speculative_tokens": 2, "method":"deepseek_mtp"}'\
      --compilation-config'{"cudagraph_mode": "FULL_DECODE_ONLY","cudagraph_capture_sizes":[144]}'\
      --kv-transfer-config\
      '{"kv_connector": "MooncakeConnectorV1",
       "kv_role": "kv_consumer",
       "kv_port": "30200",
       "engine_id": "2",
       "kv_connector_module_path": "vllm_ascend.distributed.mooncake_connector",
       "kv_connector_extra_config": {
       "prefill": {
       "dp_size": 16,
       "tp_size": 1
       },
       "decode": {
       "dp_size": 32,
       "tp_size": 1
       }
       }
       }'\
      --additional_config'{"enable_cpu_binding": "true", "multistream_overlap_shared_expert": false, "multistream_dsa_preprocess": false}'
      

Once the preparation is done, you can start the server with the following command on each node:

  1. Prefill node 0

# change ip to your own
pythonlaunch_online_dp.py--dp-size16--tp-size1--dp-size-local16--dp-rank-start0--dp-addressxx.xx.xx.1--dp-rpc-port12321--vllm-start-port7100
  1. Prefill node 1

# change ip to your own
pythonlaunch_online_dp.py--dp-size16--tp-size1--dp-size-local16--dp-rank-start0--dp-addressxx.xx.xx.2--dp-rpc-port12321--vllm-start-port7100
  1. Decode node 0

# change ip to your own
pythonlaunch_online_dp.py--dp-size32--dp-size-local16--dp-rank-start0--dp-addressxx.xx.xx.3--dp-rpc-port12321--vllm-start-port7100
  1. Decode node 1

# change ip to your own
pythonlaunch_online_dp.py--dp-size32--dp-size-local16--dp-rank-start16--dp-addressxx.xx.xx.3--dp-rpc-port12321--vllm-start-port7100

Finally, Refer to Prefill-Decode Disaggregation (Deepseek) to deploy the P-D disaggregation proxy.

Functional Verification#

Once your server is started, you can query the model with input prompts:

curlhttp://<node0_ip>:<port>/v1/chat/completions\
-H"Content-Type: application/json"\
-d'{
 "model": "deepseek_v4",
 "messages": [
 {
 "role": "user",
 "content": "Who are you?"
 }
 ],
 "max_tokens": 256,
 "temperature": 0
 }'

Accuracy Evaluation#

Here are two accuracy evaluation methods.

Using AISBench#

  1. Refer to Using AISBench for details.

  2. After execution, you can get the result.

Using Language Model Evaluation Harness#

As an example, take the gsm8k dataset as a test dataset, and run accuracy evaluation of DeepSeek-V4 in online mode.

  1. Refer to Using lm_eval for lm_eval installation.

  2. Run lm_eval to execute the accuracy evaluation.

lm_eval\
--modellocal-completions\
--model_argsmodel=/root/.cache/Eco-Tech/DeepSeek-V4-Flash-w8a8-mtp,base_url=http://127.0.0.1:8006/v1/completions,tokenized_requests=False,trust_remote_code=True\
--tasksgsm8k\
--output_path./
  1. After execution, you can get the result.

Performance#

Using AISBench#

Refer to Using AISBench for performance evaluation for details.

Using vLLM Benchmark#

Run performance evaluation of DeepSeek-V4-Flash-w8a8-mtp as an example.

Refer to vllm benchmark for more details.

There are three vllm bench subcommand:

  • latency: Benchmark the latency of a single batch of requests.

  • serve: Benchmark the online serving throughput.

  • throughput: Benchmark offline inference throughput.

Take the serve as an example. Run the code as follows.

exportVLLM_USE_MODELSCOPE=true
vllmbenchserve--model/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V4-Flash-w8a8-mtp--dataset-namerandom--random-input200--num-prompt200--request-rate1--save-result--result-dir./
Contents