VOOZH about

URL: https://docs.vllm.ai/en/latest/features/multimodal_inputs/

⇱ Multimodal Inputs - vLLM


Skip to content

Multimodal Inputs

This page teaches you how to pass multi-modal inputs to multi-modal models in vLLM.

Note

We are actively iterating on multi-modal support. See this RFC for upcoming changes, and open an issue on GitHub if you have any feedback or feature requests.

Tip

When serving multi-modal models, consider setting --allowed-media-domains to restrict domain that vLLM can access to prevent it from accessing arbitrary endpoints that can potentially be vulnerable to Server-Side Request Forgery (SSRF) attacks. You can provide a list of domains for this arg. For example: --allowed-media-domains upload.wikimedia.org github.com www.bogotobogo.com

Also, consider setting VLLM_MEDIA_URL_ALLOW_REDIRECTS=0 to prevent HTTP redirects from being followed to bypass domain restrictions.

This restriction is especially important if you run vLLM in a containerized environment where the vLLM pods may have unrestricted access to internal networks.

Offline Inference

To input multi-modal data, follow this schema in vllm.inputs.PromptType:

  • prompt: The prompt should follow the format that is documented on HuggingFace.
  • multi_modal_data: This is a dictionary that follows the schema defined in vllm.inputs.MultiModalDataDict.

Image Inputs

You can pass a single image to the 'image' field of the multi-modal dictionary, as shown in the following examples:

Full example: examples/generate/multimodal/vision_language_offline.py

To substitute multiple images inside the same text prompt, you can pass in a list of images instead:

Full example: examples/generate/multimodal/vision_language_multi_image_offline.py

If using the LLM.chat method, you can pass images directly in the message content using various formats: image URLs, PIL Image objects, or pre-computed embeddings:

Multi-image input can be extended to perform video captioning. We show this with Qwen2-VL as it supports videos:

Custom RGBA Background Color

When loading RGBA images (images with transparency), vLLM converts them to RGB format. By default, transparent pixels are replaced with white background. You can customize this background color using the rgba_background_color parameter in media_io_kwargs.

Note

  • The rgba_background_color accepts RGB values as a list [R, G, B] or tuple (R, G, B) where each value is 0-255
  • This setting only affects RGBA images with transparency; RGB images are unchanged
  • If not specified, the default white background (255, 255, 255) is used for backward compatibility

Moondream3 Prompt Recipes

Moondream3ForCausalLM supports two task-specific prompt formats:

  • query: ask a question about the image.
  • caption: generate a caption for the image.
fromvllmimport LLM, SamplingParams
fromvllm.assets.imageimport ImageAsset
llm = LLM(
 model="moondream/moondream3-preview",
 tokenizer="moondream/starmie-v1",
 trust_remote_code=True,
 max_model_len=2048,
 limit_mm_per_prompt={"image": 1},
)
image = ImageAsset("stop_sign").pil_image
defmake_query_prompt(question: str) -> str:
 return (
 "<|endoftext|><image><|md_reserved_0|>query<|md_reserved_1|>"
 f"{question}<|md_reserved_2|>"
 )
defmake_caption_prompt(length: str = "normal") -> str:
 return (
 "<|endoftext|><image><|md_reserved_0|>"
 f"describe<|md_reserved_1|>{length}<|md_reserved_2|>"
 )
query_out = llm.generate(
 {
 "prompt": make_query_prompt("What is shown in this image?"),
 "multi_modal_data": {"image": image},
 },
 SamplingParams(max_tokens=64, temperature=0),
)[0].outputs[0].text
caption_out = llm.generate(
 {
 "prompt": make_caption_prompt(),
 "multi_modal_data": {"image": image},
 },
 SamplingParams(max_tokens=100, temperature=0),
)[0].outputs[0].text
print("query:", query_out)
print("caption:", caption_out)

Note

The native Moondream3 model also has detect and point skills. Those require custom coordinate decoding and are not exposed by this vLLM implementation.

Video Inputs

You can pass a list of NumPy arrays directly to the 'video' field of the multi-modal dictionary instead of using multi-image input.

Instead of NumPy arrays, you can also pass 'torch.Tensor' instances, as shown in this example using Qwen2.5-VL:

Full example: examples/generate/multimodal/vision_language_offline.py

Audio Inputs

You can pass a tuple (array, sampling_rate) to the 'audio' field of the multi-modal dictionary.

Full example: examples/generate/multimodal/audio_language_offline.py

Chunking Long Audio for Transcription

Speech-to-text models like Whisper have a maximum audio length they can process (typically 30 seconds). For longer audio files, vLLM provides a utility to intelligently split audio into chunks at quiet points to minimize cutting through speech.

fromvllmimport LLM, SamplingParams
fromvllm.multimodal.audioimport split_audio
fromvllm.multimodal.media.audioimport load_audio
# Load long audio file
audio, sr = load_audio("long_audio.wav", sr=16000)
# Split into chunks at low-energy (quiet) regions
chunks = split_audio(
 audio_data=audio,
 sample_rate=sr,
 max_clip_duration_s=30.0, # Maximum chunk length in seconds
 overlap_duration_s=1.0, # Search window for finding quiet split points
 min_energy_window_size=1600, # Window size for energy calculation (~100ms at 16kHz)
)
# Initialize Whisper model
llm = LLM(model="openai/whisper-large-v3-turbo")
sampling_params = SamplingParams(temperature=0, max_tokens=256)
# Transcribe each chunk
transcriptions = []
for chunk in chunks:
 outputs = llm.generate({
 "prompt": "<|startoftranscript|><|en|><|transcribe|><|notimestamps|>",
 "multi_modal_data": {"audio": (chunk, sr)},
 }, sampling_params)
 transcriptions.append(outputs[0].outputs[0].text)
# Combine results
full_transcription = " ".join(transcriptions)

The split_audio function:

  • Splits audio at quiet points to avoid cutting through speech
  • Uses RMS energy to find low-amplitude regions within the overlap window
  • Preserves all audio samples (no data loss)
  • Supports any sample rate

Automatic Audio Channel Normalization

vLLM automatically normalizes audio channels for models that require specific audio formats. When loading audio with libraries like torchaudio, stereo files return shape [channels, time], but many audio models (particularly Whisper-based models) expect mono audio with shape [time].

Supported models with automatic mono conversion:

  • Whisper and all Whisper-based models
  • Qwen2-Audio
  • Qwen2.5-Omni / Qwen3-Omni (inherits from Qwen2.5-Omni)
  • Ultravox

For these models, vLLM automatically:

  1. Detects if the model requires mono audio via the feature extractor
  2. Converts multi-channel audio to mono using channel averaging
  3. Handles both (channels, time) format (torchaudio) and (time, channels) format (soundfile)

Example with stereo audio:

importtorchaudio
fromvllmimport LLM
# Load stereo audio file - returns (channels, time) shape
audio, sr = torchaudio.load("stereo_audio.wav")
print(f"Original shape: {audio.shape}") # e.g., torch.Size([2, 16000])
# vLLM automatically converts to mono for Whisper-based models
llm = LLM(model="openai/whisper-large-v3")
outputs = llm.generate({
 "prompt": "",
 "multi_modal_data": {"audio": (audio.numpy(), sr)},
})

No manual conversion is needed - vLLM handles the channel normalization automatically based on the model's requirements.

Embedding Inputs

To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, pass a tensor of shape (..., hidden_size of LM) to the corresponding field of the multi-modal dictionary. The exact shape depends on the model being used.

You must enable this feature via enable_mm_embeds=True.

Warning

The vLLM engine may crash if incorrect shape of embeddings is passed. Only enable this flag for trusted users!

Image Embeddings

For Qwen3-VL, the image_embeds should contain both the base image embedding and deepstack features.

Audio Embedding Inputs

You can pass pre-computed audio embeddings similar to image embeddings:

Cached Inputs

When using multi-modal inputs, vLLM normally hashes each media item by content to enable caching across requests. You can optionally pass multi_modal_uuids to provide your own stable IDs for each item so caching can reuse work across requests without rehashing the raw content.

Using UUIDs, you can also skip sending media data entirely if you expect cache hits for respective items. Note that the request will fail if the skipped media doesn't have a corresponding UUID, or if the UUID fails to hit the cache.

Warning

If both multimodal processor caching and prefix caching are disabled, user-provided multi_modal_uuids are ignored.

Online Serving

Our OpenAI-compatible server accepts multi-modal data via the Chat Completions API. Media inputs also support optional UUIDs users can provide to uniquely identify each media, which is used to cache the media results across requests.

Important

A chat template is required to use Chat Completions API. For HF format models, the default chat template is defined inside chat_template.json or tokenizer_config.json.

If no default chat template is available, we will first look for a built-in fallback in vllm/transformers_utils/chat_templates/registry.py. If no fallback is available, an error is raised and you have to provide the chat template manually via the --chat-template argument.

For certain models, we provide alternative chat templates inside examples. For example, VLM2Vec uses examples/pooling/embed/template/vlm2vec_phi3v.jinja which is different from the default one for Phi-3-Vision.

Image Inputs

Image input is supported according to OpenAI Vision API. Here is a simple example using Phi-3.5-Vision.

First, launch the OpenAI-compatible server:

vllmservemicrosoft/Phi-3.5-vision-instruct--runnergenerate\
--trust-remote-code--max-model-len4096--limit-mm-per-prompt.image2

Then, you can use the OpenAI client as follows:

Full example: examples/generate/multimodal/openai_chat_completion_client_for_multimodal.py

Tip

Loading from local file paths is also supported on vLLM: You can specify the allowed local media path via --allowed-local-media-path when launching the API server/engine, and pass the file path as url in the API request.

Tip

There is no need to place image placeholders in the text content of the API request - they are already represented by the image content. In fact, you can place image placeholders in the middle of the text by interleaving text and image content.

Note

By default, the timeout for fetching images through HTTP URL is 5 seconds. You can override this by setting the environment variable:

exportVLLM_IMAGE_FETCH_TIMEOUT=<timeout>

Video Inputs

Instead of image_url, you can pass a video file via video_url. Here is a simple example using LLaVA-OneVision.

First, launch the OpenAI-compatible server:

vllmservellava-hf/llava-onevision-qwen2-0.5b-ov-hf--runnergenerate--max-model-len8192

Then, you can use the OpenAI client as follows:

Full example: examples/generate/multimodal/openai_chat_completion_client_for_multimodal.py

Note

By default, the timeout for fetching videos through HTTP URL is 30 seconds. You can override this by setting the environment variable:

exportVLLM_VIDEO_FETCH_TIMEOUT=<timeout>

Video Frame Recovery

For improved robustness when processing potentially corrupted or truncated video files, vLLM supports optional frame recovery using a dynamic window forward-scan approach. When enabled, if a target frame fails to load during sequential reading, the next successfully grabbed frame (before the next target frame) will be used in its place.

To enable video frame recovery, pass the frame_recovery parameter via --media-io-kwargs:

# Example: Enable frame recovery
vllmserveQwen/Qwen3-VL-30B-A3B-Instruct\
--media-io-kwargs'{"video": {"frame_recovery": true}}'

Parameters:

  • frame_recovery: Boolean flag to enable forward-scan recovery. When true, failed frames are recovered using the next available frame within the dynamic window (up to the next target frame). Default is false.

How it works:

  1. The system reads frames sequentially
  2. If a target frame fails to grab, it's marked as "failed"
  3. The next successfully grabbed frame (before reaching the next target) is used to recover the failed frame
  4. This approach handles both mid-video corruption and end-of-video truncation

Works with common video formats like MP4 when using OpenCV backends.

Pre-extracted Frame Sequences with media_io_kwargs

When you extract video frames on the client side and send them as video/jpeg (base64-concatenated JPEG frames), you can preserve the original video metadata by using media_io_kwargs in your request. This enables more accurate video understanding by preserving temporal information that would otherwise be lost during client-side frame extraction.

Supported Parameters:

Parameter Type Description
fps float Frame rate of the original video
frames_indices list[int] Indices of the actually sampled frames
total_num_frames int Total frame count of the original video
duration float Duration of the original video in seconds
do_sample_frames bool Whether to perform frame sampling

Why use media_io_kwargs?

When extracting frames client-side, the server loses important context about the original video:

  • Temporal information: Which frames were sampled and their positions in the original timeline
  • Video duration: How long the original video was
  • Frame rate: The original playback speed

By passing this metadata, the model can better understand the temporal distribution of the sampled frames and whether important moments might have been skipped.

Custom RGBA Background Color

To use a custom background color for RGBA images, pass the rgba_background_color parameter via --media-io-kwargs:

# Example: Black background for dark theme
vllmservellava-hf/llava-1.5-7b-hf\
--media-io-kwargs'{"image": {"rgba_background_color": [0, 0, 0]}}'
# Example: Custom gray background
vllmservellava-hf/llava-1.5-7b-hf\
--media-io-kwargs'{"image": {"rgba_background_color": [128, 128, 128]}}'

Audio Inputs

Audio input is supported according to OpenAI Audio API. Here is a simple example using Ultravox-v0.5-1B.

First, launch the OpenAI-compatible server:

vllmservefixie-ai/ultravox-v0_5-llama-3_2-1b

Then, you can use the OpenAI client as follows:

Alternatively, you can pass audio_url, which is the audio counterpart of image_url for image input:

Full example: examples/generate/multimodal/openai_chat_completion_client_for_multimodal.py

Note

By default, the timeout for fetching audios through HTTP URL is 10 seconds. You can override this by setting the environment variable:

exportVLLM_AUDIO_FETCH_TIMEOUT=<timeout>

Embedding Inputs

To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model, pass a tensor of shape (..., hidden_size of LM) for each item to the corresponding field of the multi-modal dictionary.

Important

Unlike offline inference, the embeddings for each item must be passed separately in order for placeholder tokens to be applied correctly by the chat template.

You must enable this feature via the --enable-mm-embeds flag in vllm serve.

Warning

The vLLM engine may crash if incorrect shape of embeddings is passed. Only enable this flag for trusted users!

Image Embedding Inputs

For image embeddings, you can pass the base64-encoded tensor to the image_embeds field. The following example demonstrates how to pass image embeddings to the OpenAI server:

Cached Inputs

Just like with offline inference, you can skip sending media if you expect cache hits with provided UUIDs. You can do so by sending media like this: