Ring-1T-FP8 Usage Guide¶
This guide describes how to run Ring-1T-FP8.
Installing vLLM¶
uvvenv
source.venv/bin/activate
uvpipinstall-Uvllm--torch-backendauto
Installing vLLM (For AMD ROCm: MI300x/MI325x/MI355x)¶
uvpipinstallvllm--extra-index-urlhttps://wheels.vllm.ai/rocm/0.14.1/rocm700
Running Ring-1T-FP8 with FP8 KV Cache on 8xH200¶
This guide covers the simplest way to run the model, using pure tensor parallel across 8 GPUs.
# Start server with FP8 model on 8 GPUs
vllmserveinclusionAI/Ring-1T-FP8\
--trust-remote-code\
--tensor-parallel-size8\
--gpu-memory-utilization0.97\
--max-num-seqs32\
--kv-cache-dtypefp8\
--compilation-config'{"use_inductor": false}'\
--served-model-nameRing-1T-FP8
- You can set
--max-model-lento preserve memory.--max-model-len=65536is usually good for most scenarios. - You can set
--max-num-batched-tokensto balance throughput and latency, higher means higher throughput but higher latency.--max-num-batched-tokens=32768is usually good for prompt-heavy workloads. But you can reduce it to 16384 and 8192 to reduce activation memory usage and decrease latency. - In the example, 97% of the total memory is used for this model, you can reduce it to a smaller number if an Out-Of-Memory (OOM) error occurs.
Running Ring-1T-FP8 with FP8 KV Cache on 8xMI300x/MI325x/MI355x¶
# Start server with FP8 model on 8 GPUs
exportVLLM_ROCM_USE_AITER=1
vllmserveinclusionAI/Ring-1T-FP8\
--trust-remote-code\
--tensor-parallel-size8\
--gpu-memory-utilization0.9\
--max-num-seqs32\
--kv-cache-dtypefp8\
--served-model-nameRing-1T-FP8
export VLLM_ROCM_USE_AITER=1 for Better Performance on AMD GPUs. The default is export VLLM_ROCM_USE_AITER=0
Sending Example Request¶
You can send a request like the following to quickly verify the deployment.
curlhttp://localhost:8000/v1/chat/completions
-H"Content-Type: application/json"\
-d'{
"model": "Ring-1T-FP8",
"messages": [
{
"role": "user",
"content": "9.11 and 9.8, which is greater?"
}
]
}'
