VOOZH about

URL: https://docs.spheron.ai/quick-guides/llms/frameworks/lmdeploy

⇱ LMDeploy | TurboMind Inference Toolkit on GPU | Spheron Docs


Skip to content

Deploy LMDeploy with the TurboMind inference engine on Spheron A100 or H100 instances. LMDeploy supports AWQ quantization for memory-efficient inference and exposes an OpenAI-compatible API.

Recommended hardware

Model SizeRecommended GPUInstance TypeNotes
7B (AWQ)RTX 4090 (24GB)Dedicated or Spot~8GB VRAM with W4A16 AWQ
7B (FP16)A100 40GBDedicatedFull precision
30B+A100 80GB (1–2×)DedicatedUse --tp 2 for tensor parallelism
70B+H100 80GB (2× or more)ClusterTurboMind multi-GPU

Manual setup

Use these steps to set up the server manually after SSH-ing into your instance. This works on any provider regardless of cloud-init support.

Step 1: Connect to your instance

ssh <user>@<ipAddress>

Replace <user> with the username shown in the instance details panel in the dashboard (e.g., ubuntu for Spheron AI instances) and <ipAddress> with your instance's public IP.

Step 2: Install LMDeploy

sudo apt-get update -y
sudo apt-get install -y python3-pip
pip install lmdeploy

Step 3: Start the server

Run the server in the foreground to verify it works:

python3 -m lmdeploy serve api_server \
 Qwen/Qwen2.5-7B-Instruct \
 --server-port 23333 \
 --backend turbomind

Press Ctrl+C to stop. Replace Qwen/Qwen2.5-7B-Instruct with your target model.

Step 4: Run as a background service

To keep the server running after you close your SSH session, create a systemd service:

sudo tee /etc/systemd/system/lmdeploy.service > /dev/null << 'EOF'
[Unit]
Description=LMDeploy Inference Server
After=network.target

[Service]
Type=simple
ExecStart=/usr/bin/python3 -m lmdeploy serve api_server \
 Qwen/Qwen2.5-7B-Instruct \
 --server-port 23333 \
 --backend turbomind
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable lmdeploy
sudo systemctl start lmdeploy

AWQ quantization

To convert a model to AWQ 4-bit before serving (reduces VRAM by ~50%):

lmdeploy lite auto_awq \
 Qwen/Qwen2.5-7B-Instruct \
 --calib-dataset ptb \
 --calib-samples 128 \
 --work-dir ./qwen-7b-awq

Then serve the quantized model:

lmdeploy serve api_server ./qwen-7b-awq \
 --server-port 23333 \
 --backend turbomind

Accessing the server

SSH tunnel

ssh -L 23333:localhost:23333 <user>@<ipAddress>

Usage example

from openai import OpenAI

client = OpenAI(
 base_url="http://localhost:23333/v1",
 api_key="not-needed",
)

response = client.chat.completions.create(
 model="Qwen/Qwen2.5-7B-Instruct",
 messages=[{"role": "user", "content": "What is AWQ quantization?"}],
)
print(response.choices[0].message.content)

Performance flags

FlagDescription
--backend turbomindUse TurboMind engine (default, fastest)
--tpTensor parallel degree
--cache-max-entry-countKV cache size fraction

Check server logs

journalctl -u lmdeploy -f

Cloud-init startup script (optional)

If your provider supports cloud-init, you can paste this into the Startup Script field when deploying to automate the setup above.

#cloud-config
runcmd:
 - apt-get update -y
 - apt-get install -y python3-pip
 - pip install lmdeploy
 - |
 cat > /etc/systemd/system/lmdeploy.service << 'EOF'
 [Unit]
 Description=LMDeploy Inference Server
 After=network.target

 [Service]
 Type=simple
 ExecStart=/usr/bin/python3 -m lmdeploy serve api_server \
 Qwen/Qwen2.5-7B-Instruct \
 --server-port 23333 \
 --backend turbomind
 Restart=on-failure
 RestartSec=10

 [Install]
 WantedBy=multi-user.target
 EOF
 - systemctl daemon-reload
 - systemctl enable lmdeploy
 - systemctl start lmdeploy

Replace Qwen/Qwen2.5-7B-Instruct with your target model.

What's next