VOOZH about

URL: https://docs.vllm.ai/projects/recipes/en/latest/inclusionAI/Ring-1T-FP8.html

⇱ Ring-1T-FP8 Usage Guide - vLLM Recipes


Skip to content

Ring-1T-FP8 Usage Guide

This guide describes how to run Ring-1T-FP8.

Installing vLLM

uvvenv
source.venv/bin/activate
uvpipinstall-Uvllm--torch-backendauto

Installing vLLM (For AMD ROCm: MI300x/MI325x/MI355x)

uvpipinstallvllm--extra-index-urlhttps://wheels.vllm.ai/rocm/0.14.1/rocm700
⚠️ The vLLM wheel for ROCm is compatible with Python 3.12, ROCm 7.0, and glibc >= 2.35. If your environment is incompatible, please use docker flow in vLLM

Running Ring-1T-FP8 with FP8 KV Cache on 8xH200

This guide covers the simplest way to run the model, using pure tensor parallel across 8 GPUs.

# Start server with FP8 model on 8 GPUs
vllmserveinclusionAI/Ring-1T-FP8\
--trust-remote-code\
--tensor-parallel-size8\
--gpu-memory-utilization0.97\
--max-num-seqs32\
--kv-cache-dtypefp8\
--compilation-config'{"use_inductor": false}'\
--served-model-nameRing-1T-FP8
  • You can set --max-model-len to preserve memory. --max-model-len=65536 is usually good for most scenarios.
  • You can set --max-num-batched-tokens to balance throughput and latency, higher means higher throughput but higher latency. --max-num-batched-tokens=32768 is usually good for prompt-heavy workloads. But you can reduce it to 16384 and 8192 to reduce activation memory usage and decrease latency.
  • In the example, 97% of the total memory is used for this model, you can reduce it to a smaller number if an Out-Of-Memory (OOM) error occurs.

Running Ring-1T-FP8 with FP8 KV Cache on 8xMI300x/MI325x/MI355x

# Start server with FP8 model on 8 GPUs
exportVLLM_ROCM_USE_AITER=1
vllmserveinclusionAI/Ring-1T-FP8\
--trust-remote-code\
--tensor-parallel-size8\
--gpu-memory-utilization0.9\
--max-num-seqs32\
--kv-cache-dtypefp8\
--served-model-nameRing-1T-FP8
* You can set export VLLM_ROCM_USE_AITER=1 for Better Performance on AMD GPUs. The default is export VLLM_ROCM_USE_AITER=0

Sending Example Request

You can send a request like the following to quickly verify the deployment.

curlhttp://localhost:8000/v1/chat/completions
-H"Content-Type: application/json"\
-d'{
 "model": "Ring-1T-FP8",
 "messages": [
 {
 "role": "user",
 "content": "9.11 and 9.8, which is greater?"
 }
 ]
 }'