Quickstart#
Introduction#
This section guides you through container-based environment setup and large model inference, using the Qwen3-0.6B offline single-GPU inference script as an example.
For details on using different models, see the corresponding model tutorial in the “Model Tutorials” directory, for example, Qwen3-30B-A3B.
For details on using different functions, see the corresponding function tutorial in the “Function Tutorials” directory, for example, Prefill-Decode Disaggregation (Deepseek).
Prerequisites#
Supported Devices#
Atlas A2 training series (Atlas 800T A2, Atlas 900 A2 PoD, Atlas 200T A2 Box16, Atlas 300T A2)
Atlas 800I A2 inference series (Atlas 800I A2)
Atlas A3 training series (Atlas 800T A3, Atlas 900 A3 SuperPoD, Atlas 9000 A3 SuperPoD)
Atlas 800I A3 inference series (Atlas 800I A3)
[Experimental] Atlas 300I inference series (Atlas 300I Duo)
Requirements#
OS: Linux
Python: >= 3.10, < 3.13
Hardware with Ascend NPUs. It’s usually the Atlas 800 A2 series.
Software:
Software
Supported version
Note
Ascend HDK
Refer to the documentation CANN 9.0.0
Required for CANN
CANN
== 9.0.0
Required for vllm-ascend and torch-npu
torch-npu
== 2.10.0
Required for vllm-ascend, No need to install manually, it will be auto installed in below steps
torch
== 2.10.0
Required for torch-npu and vllm, No need to install manually, it will be auto installed in below steps
NNAL
== 9.0.0
Required for libatb.so, enables advanced tensor operations
Setup environment using container#
Before using containers, make sure Docker is installed on your system. If Docker is not installed, please refer to the Docker installation guide for installation instructions.
# Update DEVICE according to your device (/dev/davinci[0-7]) exportDEVICE=/dev/davinci0 # Update the vllm-ascend image # Atlas A2: # export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1 # Atlas A3: # export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1-a3 exportIMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1 dockerrun--rm\ --namevllm-ascend\ --shm-size=1g\ --device$DEVICE\ --device/dev/davinci_manager\ --device/dev/devmm_svm\ --device/dev/hisi_hdc\ -v/usr/local/dcmi:/usr/local/dcmi\ -v/usr/local/bin/npu-smi:/usr/local/bin/npu-smi\ -v/usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/\ -v/usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info\ -v/etc/ascend_install.info:/etc/ascend_install.info\ -v/root/.cache:/root/.cache\ -p8000:8000\ -it$IMAGEbash # Install curl apt-getupdate-y&&apt-getinstall-ycurl
# Update DEVICE according to your device (/dev/davinci[0-7]) exportDEVICE=/dev/davinci0 # Update the vllm-ascend image # Atlas A2: # export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1-openeuler # Atlas A3: # export IMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1-a3-openeuler exportIMAGE=quay.io/ascend/vllm-ascend:v0.21.0rc1-openeuler dockerrun--rm\ --namevllm-ascend\ --shm-size=1g\ --device$DEVICE\ --device/dev/davinci_manager\ --device/dev/devmm_svm\ --device/dev/hisi_hdc\ -v/usr/local/dcmi:/usr/local/dcmi\ -v/usr/local/bin/npu-smi:/usr/local/bin/npu-smi\ -v/usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/\ -v/usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info\ -v/etc/ascend_install.info:/etc/ascend_install.info\ -v/root/.cache:/root/.cache\ -p8000:8000\ -it$IMAGEbash # Install curl yumupdate-y&&yuminstall-ycurl
The default workdir is /workspace, vLLM and vLLM Ascend code are placed in /vllm-workspace and installed in development mode (pip install -e) to help developers make changes effective immediately without requiring a new installation.
Usage#
You can use ModelScope mirror to speed up download:
exportVLLM_USE_MODELSCOPE=True
There are two ways to start vLLM on Ascend NPU:
With vLLM installed, you can start generating texts for list of input prompts (i.e. offline batch inference).
Create and run a simple inference test. The example.py can be like:
fromvllmimport LLM, SamplingParams prompts = [ "Hello, my name is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95) # The first run will take about 3-5 mins (10 MB/s) to download models llm = LLM(model="Qwen/Qwen3-0.6B") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
Then run:
pythonexample.py
If you encounter a connection error with Hugging Face (e.g., We couldn't connect to 'https://huggingface.co' to load the files, and couldn't find them in the cached files.), run the following commands to use ModelScope as an alternative:
exportVLLM_USE_MODELSCOPE=True pipinstallmodelscope pythonexample.py
This section shows ascend platform is successfully detected in vllm:
INFO05-2711:40:38[__init__.py:44]Availablepluginsforgroupvllm.platform_plugins: INFO05-2711:40:38[__init__.py:46]-ascend->vllm_ascend:register INFO05-2711:40:38[__init__.py:49]Allpluginsinthisgroupwillbeloaded.Set`VLLM_PLUGINS`tocontrolwhichpluginstoload. INFO05-2711:40:38[__init__.py:238]Platformpluginascendisactivated
This section shows the final output:
Prompt:'Hello, my name is',Generatedtext:' Lucy and I am an 8 year old who loves to draw and write stories' Prompt:'The president of the United States is',Generatedtext:" a key leader in the federal government, and the president's role in the executive" Prompt:'The capital of France is',Generatedtext:' a city. What is the capital of France? The capital of France is Paris' Prompt:'The future of AI is',Generatedtext:' a topic that is being discussed in various contexts. In the business world, AI'
This section shows process exits after offline inference, and is does not affect actual inference:
(EngineCorepid=970)INFO05-1211:36:00[core.py:1201]Shutdowninitiated(timeout=0) (EngineCorepid=970)INFO05-1211:36:00[core.py:1224]Shutdowncomplete ERROR05-1211:36:01[core_client.py:704]EnginecoreprocEngineCorediedunexpectedly,shuttingdownclient. sys:1:DeprecationWarning:builtintypeswigvarlinkhasno__module__attribute
vLLM can also be deployed as a server that implements the OpenAI API protocol. Run the following command to start the vLLM server with the Qwen/Qwen3-0.6B model:
# Deploy vLLM server (The first run will take about 3-5 mins (10 MB/s) to download models) vllmserveQwen/Qwen3-0.6B&
If you see a log as below:
INFO:Startedserverprocess[3594] INFO:Waitingforapplicationstartup. INFO:Applicationstartupcomplete. INFO:Uvicornrunningonhttp://0.0.0.0:8000(PressCTRL+Ctoquit)
Congratulations, you have successfully started the vLLM server!
You can query the list of models:
curlhttp://localhost:8000/v1/models|python3-mjson.tool
You can also query the model with input prompts:
curlhttp://localhost:8000/v1/completions\ -H"Content-Type: application/json"\ -d'{ "model": "Qwen/Qwen3-0.6B", "prompt": "Beijing is a", "max_completion_tokens": 5, "temperature": 0 }'|python3-mjson.tool
vLLM is serving as a background process, you can use kill -2 $VLLM_PID to stop the background process gracefully, which is similar to Ctrl-C for stopping the foreground vLLM process:
VLLM_PID=$(pgrep-f"vllm serve") kill-2"$VLLM_PID"
The output is as below:
INFO:ShuttingdownFastAPIHTTPserver.
INFO:Shuttingdown
INFO:Waitingforapplicationshutdown.
INFO:Applicationshutdowncomplete.
Finally, you can exit the container by using ctrl-D.
