VOOZH about

URL: https://docs.vllm.ai/projects/ascend/en/latest/quick_start.html

⇱ Quickstart — vllm-ascend


Quickstart

Contents

Quickstart#

Introduction#

This section guides you through container-based environment setup and large model inference, using the Qwen3-0.6B offline single-GPU inference script as an example.

  • For details on using different models, see the corresponding model tutorial in the “Model Tutorials” directory, for example, Qwen3-30B-A3B.

  • For details on using different functions, see the corresponding function tutorial in the “Function Tutorials” directory, for example, Prefill-Decode Disaggregation (Deepseek).

Prerequisites#

Supported Devices#

  • Atlas A2 training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)

  • Atlas 800I A2 inference series (Atlas 800I A2)

  • Atlas A3 training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas 9000 A3 SuperPoD)

  • Atlas 800I A3 inference series (Atlas 800I A3)

  • [Experimental] Atlas 300I inference series (Atlas 300I Duo)

Requirements#

  • OS: Linux

  • Python: >= 3.10, < 3.13

  • Hardware with Ascend NPUs. It’s usually the Atlas 800 A2 series.

  • Software:

    Software

    Supported version

    Note

    Ascend HDK

    Refer to the documentation CANN 9.0.0

    Required for CANN

    CANN

    == 9.0.0

    Required for vllm-ascend and torch-npu

    torch-npu

    == 2.10.0

    Required for vllm-ascend, No need to install manually, it will be auto installed in below steps

    torch

    == 2.10.0

    Required for torch-npu and vllm, No need to install manually, it will be auto installed in below steps

    NNAL

    == 9.0.0

    Required for libatb.so, enables advanced tensor operations

Setup environment using container#

Before using containers, make sure Docker is installed on your system. If Docker is not installed, please refer to the Docker installation guide for installation instructions.

# Update DEVICE according to your device (/dev/davinci[0-7])
exportDEVICE=/dev/davinci0
# Update the vllm-ascend image
# Atlas A2:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1
# Atlas A3:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1-a3
exportIMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1
dockerrun--rm\
--namevllm-ascend\
--shm-size=1g\
--device$DEVICE\
--device/dev/davinci_manager\
--device/dev/devmm_svm\
--device/dev/hisi_hdc\
-v/usr/local/dcmi:/usr/local/dcmi\
-v/usr/local/bin/npu-smi:/usr/local/bin/npu-smi\
-v/usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/\
-v/usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info\
-v/etc/ascend_install.info:/etc/ascend_install.info\
-v/root/.cache:/root/.cache\
-p8000:8000\
-it$IMAGEbash
# Install curl
apt-getupdate-y&&apt-getinstall-ycurl
# Update DEVICE according to your device (/dev/davinci[0-7])
exportDEVICE=/dev/davinci0
# Update the vllm-ascend image
# Atlas A2:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1-openeuler
# Atlas A3:
# export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1-a3-openeuler
exportIMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1-openeuler
dockerrun--rm\
--namevllm-ascend\
--shm-size=1g\
--device$DEVICE\
--device/dev/davinci_manager\
--device/dev/devmm_svm\
--device/dev/hisi_hdc\
-v/usr/local/dcmi:/usr/local/dcmi\
-v/usr/local/bin/npu-smi:/usr/local/bin/npu-smi\
-v/usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/\
-v/usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info\
-v/etc/ascend_install.info:/etc/ascend_install.info\
-v/root/.cache:/root/.cache\
-p8000:8000\
-it$IMAGEbash
# Install curl
yumupdate-y&&yuminstall-ycurl

The default workdir is /workspace, vLLM and vLLM Ascend code are placed in /vllm-workspace and installed in development mode (pip install -e) to help developers make changes effective immediately without requiring a new installation.

Usage#

You can use ModelScope mirror to speed up download:

exportVLLM_USE_MODELSCOPE=True

There are two ways to start vLLM on Ascend NPU:

With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inference).

Create and run a simple inference test. The example.py can be like:

fromvllmimport LLM, SamplingParams

prompts = [
 "Hello, my name is",
 "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# The first run will take about 3-5 mins (10 MB/s) to download models
llm = LLM(model="Qwen/Qwen3-0.6B")

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
 prompt = output.prompt
 generated_text = output.outputs[0].text
 print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Then run:

pythonexample.py

If you encounter a connection error with Hugging Face (e.g., We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.), run the following commands to use ModelScope as an alternative:

exportVLLM_USE_MODELSCOPE=True
pipinstallmodelscope
pythonexample.py

This section shows ascend platform is successfully detected in vllm:

INFO05-2711:40:38[__init__.py:44]Availablepluginsforgroupvllm.platform_plugins:
INFO05-2711:40:38[__init__.py:46]-ascend->vllm_ascend:register
INFO05-2711:40:38[__init__.py:49]Allpluginsinthisgroupwillbeloaded.Set`VLLM_PLUGINS`tocontrolwhichpluginstoload.
INFO05-2711:40:38[__init__.py:238]Platformpluginascendisactivated

This section shows the final output:

Prompt:'Hello, my name is',Generatedtext:' Lucy and I am an 8 year old who loves to draw and write stories'
Prompt:'The president of the United States is',Generatedtext:" a key leader in the federal government, and the president's role in the executive"
Prompt:'The capital of France is',Generatedtext:' a city. What is the capital of France? The capital of France is Paris'
Prompt:'The future of AI is',Generatedtext:' a topic that is being discussed in various contexts. In the business world, AI'

This section shows process exits after offline inference, and is does not affect actual inference:

(EngineCorepid=970)INFO05-1211:36:00[core.py:1201]Shutdowninitiated(timeout=0)
(EngineCorepid=970)INFO05-1211:36:00[core.py:1224]Shutdowncomplete
ERROR05-1211:36:01[core_client.py:704]EnginecoreprocEngineCorediedunexpectedly,shuttingdownclient.
sys:1:DeprecationWarning:builtintypeswigvarlinkhasno__module__attribute

vLLM can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen3-0.6B model:

# Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models)
vllmserveQwen/Qwen3-0.6B&

If you see a log as below:

INFO:Startedserverprocess[3594]
INFO:Waitingforapplicationstartup.
INFO:Applicationstartupcomplete.
INFO:Uvicornrunningonhttp://0.0.0.0:8000(PressCTRL+Ctoquit)

Congratulations, you have successfully started the vLLM server!

You can query the list of models:

curlhttp://localhost:8000/v1/models|python3-mjson.tool

You can also query the model with input prompts:

curlhttp://localhost:8000/v1/completions\
-H"Content-Type: application/json"\
-d'{
 "model": "Qwen/Qwen3-0.6B",
 "prompt": "Beijing is a",
 "max_completion_tokens": 5,
 "temperature": 0
 }'|python3-mjson.tool

vLLM is serving as a background process, you can use kill -2 $VLLM_PID to stop the background process gracefully, which is similar to Ctrl-C for stopping the foreground vLLM process:

VLLM_PID=$(pgrep-f"vllm serve")
kill-2"$VLLM_PID"

The output is as below:

INFO:ShuttingdownFastAPIHTTPserver.
INFO:Shuttingdown
INFO:Waitingforapplicationshutdown.
INFO:Applicationshutdowncomplete.

Finally, you can exit the container by using ctrl-D.

Contents