VOOZH about

URL: https://docs.vllm.ai/projects/vllm-omni/en/latest/getting_started/quickstart/

⇱ Quickstart - vLLM-Omni


Skip to content

Quickstart

This guide will help you quickly get started with vLLM-Omni to perform:

  • Offline batched inference
  • Online serving using OpenAI-compatible server

Prerequisites

  • OS: Linux
  • Python: 3.12

Installation

For installation on GPU from source:

uvvenv--python3.12--seed
source.venv/bin/activate
# On CUDA
uvpipinstallvllm==0.23.0--torch-backend=auto
# On ROCm
uvpipinstallvllm==0.23.0+rocm722--extra-index-urlhttps://wheels.vllm.ai/rocm/0.23.0/rocm722
gitclonehttps://github.com/vllm-project/vllm-omni.git
cdvllm-omni
uvpipinstall-e.

For additional installation methods — please see the installation guide.

Note

It is important to install the same major & minor version of vLLM and vLLM Omni, otherwise things may not work as expected. If the versions are misaligned, you will see a warning when you import vLLM Omni.

If you are seeing strange behavior with the vllm command not handling the --omni flag correctly, you most likely have a version mismatch with vLLM < 0.23.0 and vLLM Omni 0.23.0, as vLLM Omni no longer hijacks the vLLM entrypoint. Updating vLLM should resolve this issue.

Offline Inference

Text-to-image generation quickstart with vLLM-Omni:

fromvllm_omni.entrypoints.omniimport Omni
if __name__ == "__main__":
 omni = Omni(model="Tongyi-MAI/Z-Image-Turbo")
 prompt = "a cup of coffee on the table"
 outputs = omni.generate(prompt)
 images = outputs[0].request_output.images
 images[0].save("coffee.png")

You can pass a list of prompts and wait for them to process altogether, shown below.

Info

However, it is not currently recommended to do so because not all models support batch inference, and batch requesting mostly does not provide significant performance improvement (despite the impression that it does). This feature is primarily for the sake of interface compatibility with vLLM and to allow for future improvements.

fromvllm_omni.entrypoints.omniimport Omni
if __name__ == "__main__":
 omni = Omni(
 model="Tongyi-MAI/Z-Image-Turbo",
 # stage_configs_path="./stage-config.yaml", # See below
 )
 prompts = [
 "a cup of coffee on a table",
 "a toy dinosaur on a sandy beach",
 "a fox waking up in bed and yawning",
 ]
 omni_outputs = omni.generate(prompts)
 for i_prompt, prompt_output in enumerate(omni_outputs):
 this_request_output = prompt_output.request_output
 this_images = this_request_output.images
 for i_image, image in enumerate(this_images):
 image.save(f"p{i_prompt}-img{i_image}.jpg")
 print("saved to", f"p{i_prompt}-img{i_image}.jpg")
 # saved to p0-img0.jpg
 # saved to p1-img0.jpg
 # saved to p2-img0.jpg

Info

For diffusion pipelines, the stage config field stage_args.[].engine_args.max_num_seqs is 1 by default, and the input list is sliced into single-item requests before feeding into the diffusion pipeline. For models that do internally support batched inputs, you can modify this configuration to let the model accept a longer batch of prompts.

For more usages, please refer to offline inference

Online Serving with OpenAI-Completions API

Text-to-image generation quickstart with vLLM-Omni:

vllmserveTongyi-MAI/Z-Image-Turbo--omni--port8091
curl-shttp://localhost:8091/v1/chat/completions\
-H"Content-Type: application/json"\
-d'{
 "messages": [
 {"role": "user", "content": "a cup of coffee on the table"}
 ],
 "extra_body": {
 "height": 1024,
 "width": 1024,
 "num_inference_steps": 50,
 "guidance_scale": 4.0,
 "seed": 42
 }
 }'|jq-r'.choices[0].message.content[0].image_url.url'|cut-d','-f2|base64-d>coffee.png

For more details, please refer to online serving.