`vllm` ¶

vLLM: a high-throughput and memory-efficient inference engine for LLMs

Modules:

assets –
benchmarks –
collect_env –
compilation –
config –
connections –
cute_utils –
device_allocator –
distributed –
engine –
entrypoints –
env_override –
envs –
exceptions –

Custom exceptions for vLLM.
forward_context –
inputs –
ir –
kernels –

Kernel implementations for vLLM.
logger –

Logging configuration for vLLM.
logging_utils –
logits_process –
logprobs –
lora –
model_executor –
model_inspection –

Model inspection utilities for vLLM.
models –
multimodal –
outputs –
parser –
platforms –
plugins –
pooling_params –
profiler –
ray –
reasoning –
renderers –
sampling_params –

Sampling parameters for text generation.
scalar_type –
sequence –

Sequence and its related classes.
third_party –
tokenizers –
tool_parsers –
tracing –
transformers_utils –
triton_utils –
usage –
utils –
v1 –
version –
vllm_flash_attn –

Classes:

AsyncEngineArgs –

Arguments for asynchronous vLLM engine.
ClassificationOutput –

The output data of one classification output of a request.
CompletionOutput –

The output data of one completion output of a request.
EmbeddingOutput –

The output data of one embedding output of a request.
EngineArgs –

Arguments for vLLM engine.
LLM –

An LLM for generating texts from given prompts and sampling parameters.
PoolingOutput –

The output data of one pooling output of a request.
PoolingParams –

API parameters for pooling models.
PoolingRequestOutput –

The output data of a pooling request to the LLM.
RequestOutput –

The output data of a completion request to the LLM.
SamplingParams –

Sampling parameters for text generation.
ScoringOutput –

The output data of one scoring output of a request.
TextPrompt –

Schema for a text prompt.
TokensPrompt –

Schema for a tokenized prompt.

Functions:

initialize_ray_cluster –

Initialize the distributed cluster with Ray.

Attributes:

AsyncLLMEngine –

The AsyncLLMEngine class is an alias of vllm.v1.engine.async_llm.AsyncLLM.
LLMEngine –

The LLMEngine class is an alias of vllm.v1.engine.llm_engine.LLMEngine.
PromptType (TypeAlias) –

Schema for any prompt, regardless of model type.

`AsyncLLMEngine = AsyncLLM` `module-attribute` ¶

The AsyncLLMEngine class is an alias of vllm.v1.engine.async_llm.AsyncLLM.

`LLMEngine = V1LLMEngine` `module-attribute` ¶

The LLMEngine class is an alias of vllm.v1.engine.llm_engine.LLMEngine.

`PromptType = DecoderOnlyPrompt | EncoderDecoderPrompt` `module-attribute` ¶

Schema for any prompt, regardless of model type.

This is the input format accepted by most LLM APIs.

`AsyncEngineArgs` `dataclass` ¶

Bases: EngineArgs

Arguments for asynchronous vLLM engine.

`ClassificationOutput` `dataclass` ¶

The output data of one classification output of a request.

Parameters:

probs ¶
(list[float]) –

The probability vector, which is a list of floats. Its length depends on the number of classes.

`CompletionOutput` `dataclass` ¶

The output data of one completion output of a request.

Parameters:

index ¶
(int) –

The index of the output in the request.
text ¶
(str) –

The generated output text.
token_ids ¶
(Sequence[int]) –

The token IDs of the generated output text.
cumulative_logprob ¶
(float | None) –

The cumulative log probability of the generated output text.
logprobs ¶
(SampleLogprobs | None) –

The log probabilities of the top probability words at each position if the logprobs are requested.
finish_reason ¶
(str | None, default: None ) –

The reason why the sequence is finished.
stop_reason ¶
(int | str | None, default: None ) –

The stop string or token id that caused the completion to stop, None if the completion finished for some other reason including encountering the EOS token.
lora_request ¶
(LoRARequest | None, default: None ) –

The LoRA request that was used to generate the output.

`EmbeddingOutput` `dataclass` ¶

The output data of one embedding output of a request.

Parameters:

embedding ¶
(list[float]) –

The embedding vector, which is a list of floats. Its length depends on the hidden dimension of the model.

`EngineArgs` `dataclass` ¶

Arguments for vLLM engine.

Methods:

add_cli_args –

Shared CLI arguments for vLLM engine.
create_engine_config –

Create the VllmConfig.
create_speculative_config –

Initializes and returns a SpeculativeConfig object based on

Attributes:

logits_processors (list[str | type[LogitsProcessor]] | None) –

Custom logitproc types
quantization_config (dict[str, Any] | QuantizationConfigArgs | None) –

User-facing quantization configuration. Carries per-layer-kind

`logits_processors = ModelConfig.logits_processors` `class-attribute` `instance-attribute` ¶

Custom logitproc types

`quantization_config = None` `class-attribute` `instance-attribute` ¶

User-facing quantization configuration. Carries per-layer-kind QuantSpecs (linear, moe) and ignore patterns; see :class:QuantizationConfigArgs. Auto-populated from the matching online shorthand when quantization is one of the values in ONLINE_QUANT_SHORTHAND_NAMES.

`_check_feature_supported()` ¶

Raise an error if the feature is not supported.

`_get_min_mm_batched_tokens(model_config)` `staticmethod` ¶

Get the minimum max_num_batched_tokens needed for a multimodal prefix-LM model to process at least one item of any supported modality.

Returns (token_count, modality_name) for the most expensive modality, or None if the value cannot be determined at this stage.

`add_cli_args(parser)` `staticmethod` ¶

Shared CLI arguments for vLLM engine.

`create_engine_config(usage_context=None, headless=False)` ¶

Create the VllmConfig.

NOTE: If VllmConfig is incompatible, we raise an error.

`create_speculative_config(target_model_config, target_parallel_config)` ¶

Initializes and returns a SpeculativeConfig object based on speculative_config.

`LLM` ¶

Bases: BeamSearchOfflineMixin, PoolingOfflineMixin, OfflineInferenceMixin

An LLM for generating texts from given prompts and sampling parameters.

This class includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). Given a batch of prompts and sampling parameters, this class generates texts from the model, using an intelligent batching mechanism and efficient memory management.

Parameters:

model ¶
(str) –

The name or path of a HuggingFace Transformers model.
tokenizer ¶
(str | None, default: None ) –

The name or path of a HuggingFace Transformers tokenizer.
tokenizer_mode ¶
(TokenizerMode | str, default: 'auto' ) –

The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.
skip_tokenizer_init ¶
(bool, default: False ) –

If true, skip initialization of tokenizer and detokenizer. Expect valid prompt_token_ids and None for prompt from the input.
trust_remote_code ¶
(bool, default: False ) –

Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer.
allowed_local_media_path ¶
(str, default: '' ) –

Allowing API requests to read local images or videos from directories specified by the server file system. This is a security risk. Should only be enabled in trusted environments.
allowed_media_domains ¶
(list[str] | None, default: None ) –

If set, only media URLs that belong to this domain can be used for multi-modal inputs.
tensor_parallel_size ¶
(int, default: 1 ) –

The number of GPUs to use for distributed execution with tensor parallelism.
dtype ¶
(ModelDType, default: 'auto' ) –

The data type for the model weights and activations. Currently, we support float32, float16, and bfloat16. If auto, we use the dtype attribute of the Transformers model's config. However, if the dtype in the config is float32, we will use float16 instead.
quantization ¶
(QuantizationMethods | None, default: None ) –

The method used to quantize the model weights. Currently, we support "awq", "gptq", and "fp8" (experimental). If None, we first check the quantization_config attribute in the model config file. If that is None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.
revision ¶
(str | None, default: None ) –

The specific model version to use. It can be a branch name, a tag name, or a commit id.
tokenizer_revision ¶
(str | None, default: None ) –

The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id.
chat_template ¶
(Path | str | None, default: None ) –

The chat template to apply.
seed ¶
(int, default: 0 ) –

The seed to initialize the random number generator for sampling.
gpu_memory_utilization ¶
(float, default: 0.92 ) –

The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Higher values will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, it may cause out-of- memory (OOM) errors.
kv_cache_memory_bytes ¶
(int | None, default: None ) –

Size of KV Cache per GPU in bytes. By default, this is set to None and vllm can automatically infer the kv cache size based on gpu_memory_utilization. However, users may want to manually specify the kv cache memory size. kv_cache_memory_bytes allows more fine-grain control of how much memory gets used when compared with using gpu_memory_utilization. Note that kv_cache_memory_bytes (when not-None) ignores gpu_memory_utilization
cpu_offload_gb ¶
(float, default: 0 ) –

The size (GiB) of CPU memory to use for offloading the model weights. This virtually increases the GPU memory space you can use to hold the model weights, at the cost of CPU-GPU data transfer for every forward pass.
offload_group_size ¶
(int, default: 0 ) –

Prefetch offloading: Group every N layers together. Offload last offload_num_in_group layers of each group. Default is 0 (disabled).
offload_num_in_group ¶
(int, default: 1 ) –

Prefetch offloading: Number of layers to offload per group. Default is 1.
offload_prefetch_step ¶
(int, default: 1 ) –

Prefetch offloading: Number of layers to prefetch ahead. Higher values hide more latency but use more GPU memory. Default is 1.
offload_params ¶
(set[str] | None, default: None ) –

Prefetch offloading: Set of parameter name segments to selectively offload. Only parameters whose names contain one of these segments will be offloaded (e.g., {"gate_up_proj", "down_proj"} for MLP weights, or {"w13_weight", "w2_weight"} for MoE expert weights). If None or empty, all parameters are offloaded.
enforce_eager ¶
(bool, default: False ) –

Whether to enforce eager execution. If True, we will disable CUDA graph and always execute the model in eager mode. If False, we will use CUDA graph and eager execution in hybrid.
enable_return_routed_experts ¶
(bool, default: False ) –

Whether to return routed experts.
disable_custom_all_reduce ¶
(bool, default: False ) –

See ParallelConfig.
hf_token ¶
(bool | str | None, default: None ) –

The token to use as HTTP bearer authorization for remote files . If True, will use the token generated when running hf auth login (stored in ~/.cache/huggingface/token).
hf_overrides ¶
(HfOverrides | None, default: None ) –

If a dictionary, contains arguments to be forwarded to the HuggingFace config. If a callable, it is called to update the HuggingFace config.
mm_processor_kwargs ¶
(dict[str, Any] | None, default: None ) –

Arguments to be forwarded to the model's processor for multi-modal data, e.g., image processor. Overrides for the multi-modal processor obtained from AutoProcessor.from_pretrained. The available overrides depend on the model that is being run. For example, for Phi-3-Vision: {"num_crops": 4}.
pooler_config ¶
(PoolerConfig | None, default: None ) –

Initialize non-default pooling config for the pooling model, e.g., PoolerConfig(seq_pooling_type="MEAN", use_activation=False).
compilation_config ¶
(int | dict[str, Any] | CompilationConfig | None, default: None ) –

Either an integer or a dictionary. If it is an integer, it is used as the mode of compilation optimization. If it is a dictionary, it can specify the full compilation configuration.
attention_config ¶
(dict[str, Any] | AttentionConfig | None, default: None ) –

Configuration for attention mechanisms. Can be a dictionary or an AttentionConfig instance. If a dictionary, it will be converted to an AttentionConfig. Allows specifying the attention backend and other attention-related settings.
spec_method ¶
(str | None, default: None ) –

Top-level alias for speculative_config["method"].
spec_model ¶
(str | None, default: None ) –

Top-level alias for speculative_config["model"].
spec_tokens ¶
(int | None, default: None ) –

Top-level alias for speculative_config["num_speculative_tokens"].
**kwargs ¶
(Any, default: {} ) –

Arguments for EngineArgs.

Methods:

__init__ –

LLM constructor.
__repr__ –

Return a transformers-style hierarchical view of the model.
apply_model –

Run a function directly on the model inside each worker,
chat –

Generate responses for a chat conversation.
collective_rpc –

Execute an RPC call on all workers.
enqueue –

Enqueue prompts for generation without waiting for completion.
enqueue_chat –

Enqueue chat conversations for generation without waiting.
finish_weight_update –

Finish the current weight update.
from_engine_args –

Create an LLM instance from EngineArgs.
generate –

Generates the completions for the input prompts.
get_metrics –

Return a snapshot of aggregated metrics from Prometheus.
get_world_size –

Get the world size from the parallel config.
init_weight_transfer_engine –

Initialize weight transfer for RL training.
sleep –

Put the engine to sleep. The engine should not process any requests.
start_profile –

Start profiling with optional custom trace prefix.
start_weight_update –

Start a new weight update.
update_weights –

Update the weights of the model.
wait_for_completion –

Wait for all enqueued requests to complete and return results.
wake_up –

Wake up the engine from sleep mode. See the sleep

init(model, *, runner='auto', convert='auto', tokenizer=None, tokenizer_mode='auto', skip_tokenizer_init=False, trust_remote_code=False, allowed_local_media_path='', allowed_media_domains=None, tensor_parallel_size=1, dtype='auto', quantization=None, revision=None, tokenizer_revision=None, chat_template=None, seed=0, gpu_memory_utilization=0.92, cpu_offload_gb=0, offload_group_size=0, offload_num_in_group=1, offload_prefetch_step=1, offload_params=None, enforce_eager=False, enable_return_routed_experts=False, disable_custom_all_reduce=False, hf_token=None, hf_overrides=None, mm_processor_kwargs=None, pooler_config=None, structured_outputs_config=None, profiler_config=None, attention_config=None, kv_cache_memory_bytes=None, compilation_config=None, quantization_config=None, logits_processors=None, spec_method=None, spec_model=None, spec_tokens=None, **kwargs) ¶

LLM constructor.

`repr()` ¶

Return a transformers-style hierarchical view of the model.

`apply_model(func)` ¶

Run a function directly on the model inside each worker, returning the result for each of them.

Warning

To reduce the overhead of data transfer, avoid returning large arrays or tensors from this method. If you must return them, make sure you move them to CPU first to avoid taking up additional VRAM!

`chat(messages, sampling_params=None, use_tqdm=True, lora_request=None, chat_template=None, chat_template_content_format='auto', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None, tokenization_kwargs=None, mm_processor_kwargs=None)` ¶

Generate responses for a chat conversation.

The chat conversation is converted into a text prompt using the tokenizer and calls the generate method to generate the responses.

Multi-modal inputs can be passed in the same way you would pass them to the OpenAI API.

Parameters:

messages ¶
(list[ChatCompletionMessageParam] | Sequence[list[ChatCompletionMessageParam]]) –
A sequence of conversations or a single conversation.
- Each conversation is represented as a list of messages.
- Each message is a dictionary with 'role' and 'content' keys.
sampling_params ¶
(SamplingParams | Sequence[SamplingParams] | None, default: None ) –

The sampling parameters for text generation. If None, we use the default sampling parameters. When it is a single value, it is applied to every prompt. When it is a list, the list must have the same length as the prompts and it is paired one by one with the prompt.
use_tqdm ¶
(bool | Callable[..., tqdm], default: True ) –

If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.
lora_request ¶
(Sequence[LoRARequest] | LoRARequest | None, default: None ) –

LoRA request to use for generation, if any.
chat_template ¶
(str | None, default: None ) –

The template to use for structuring the chat. If not provided, the model's default chat template will be used.
chat_template_content_format ¶
(ChatTemplateContentFormatOption, default: 'auto' ) –
The format to render message content.
- "string" will render the content as a string. Example: "Who are you?"
- "openai" will render the content as a list of dictionaries, similar to OpenAI schema. Example: [{"type": "text", "text": "Who are you?"}]
add_generation_prompt ¶
(bool, default: True ) –

If True, adds a generation template to each message.
continue_final_message ¶
(bool, default: False ) –

If True, continues the final message in the conversation instead of starting a new one. Cannot be True if add_generation_prompt is also True.
chat_template_kwargs ¶
(dict[str, Any] | None, default: None ) –

Additional kwargs to pass to the chat template.
tokenization_kwargs ¶
(dict[str, Any] | None, default: None ) –

Overrides for tokenizer.encode.
mm_processor_kwargs ¶
(dict[str, Any] | None, default: None ) –

Overrides for processor.__call__.

Returns:

list[RequestOutput] –

A list of RequestOutput objects containing the generated
list[RequestOutput] –

responses in the same order as the input messages.

`collective_rpc(method, timeout=None, args=(), kwargs=None)` ¶

Execute an RPC call on all workers.

Parameters:

method ¶
(str | Callable[..., _R]) –

Name of the worker method to execute, or a callable that is serialized and sent to all workers to execute.

If the method is a callable, it should accept an additional self argument, in addition to the arguments passed in args and kwargs. The self argument will be the worker object.
timeout ¶
(float | None, default: None ) –

Maximum time in seconds to wait for execution. Raises a TimeoutError on timeout. None means wait indefinitely.
args ¶
(tuple, default: () ) –

Positional arguments to pass to the worker method.
kwargs ¶
(dict[str, Any] | None, default: None ) –

Keyword arguments to pass to the worker method.

Returns:

list[_R] –

A list containing the results from each worker.

`enqueue(prompts, sampling_params=None, lora_request=None, priority=None, use_tqdm=True, tokenization_kwargs=None, mm_processor_kwargs=None)` ¶

Enqueue prompts for generation without waiting for completion.

This method adds requests to the engine queue but does not start processing them. Use wait_for_completion() to process the queued requests and get results.

Parameters:

prompts ¶
(PromptType | Sequence[PromptType]) –

The prompts to the LLM. See generate() for details.
sampling_params ¶
(SamplingParams | Sequence[SamplingParams] | None, default: None ) –

The sampling parameters for text generation.
lora_request ¶
(Sequence[LoRARequest] | LoRARequest | None, default: None ) –

LoRA request to use for generation, if any.
priority ¶
(list[int] | None, default: None ) –

The priority of the requests, if any.
use_tqdm ¶
(bool | Callable[..., tqdm], default: True ) –

If True, shows a tqdm progress bar while adding requests.
tokenization_kwargs ¶
(dict[str, Any] | None, default: None ) –

Overrides for tokenizer.encode.
mm_processor_kwargs ¶
(dict[str, Any] | None, default: None ) –

Overrides for processor.__call__.

Returns:

list[str] –

A list of request IDs for the enqueued requests.

`enqueue_chat(messages, sampling_params=None, use_tqdm=True, lora_request=None, priority=None, chat_template=None, chat_template_content_format='auto', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None, tokenization_kwargs=None, mm_processor_kwargs=None)` ¶

Enqueue chat conversations for generation without waiting.

This method renders chat conversations and adds the resulting requests to the engine queue. Use wait_for_completion() to get results. To guarantee that all requests are queued before scheduling starts, pause scheduling with sleep(level=0) before calling this method and resume it with wake_up(tags=["scheduling"]) afterward.

Parameters:

messages ¶
(list[ChatCompletionMessageParam] | Sequence[list[ChatCompletionMessageParam]]) –

A sequence of conversations or a single conversation. Each conversation is represented as a list of messages.
sampling_params ¶
(SamplingParams | Sequence[SamplingParams] | None, default: None ) –

The sampling parameters for text generation. If None, we use the default sampling parameters.
use_tqdm ¶
(bool | Callable[..., tqdm], default: True ) –

If True, shows a tqdm progress bar while rendering conversations.
lora_request ¶
(Sequence[LoRARequest] | LoRARequest | None, default: None ) –

LoRA request to use for generation, if any.
priority ¶
(list[int] | None, default: None ) –

The priority of the requests, if any.
chat_template ¶
(str | None, default: None ) –

The template to use for structuring the chat.
chat_template_content_format ¶
(ChatTemplateContentFormatOption, default: 'auto' ) –

The format to render message content.
add_generation_prompt ¶
(bool, default: True ) –

If True, adds a generation template to each message.
continue_final_message ¶
(bool, default: False ) –

If True, continues the final message in the conversation instead of starting a new one.
tools ¶
(list[dict[str, Any]] | None, default: None ) –

Tools to make available to the model, if any.
chat_template_kwargs ¶
(dict[str, Any] | None, default: None ) –

Additional kwargs to pass to the chat template.
tokenization_kwargs ¶
(dict[str, Any] | None, default: None ) –

Overrides for tokenizer.encode.
mm_processor_kwargs ¶
(dict[str, Any] | None, default: None ) –

Overrides for processor.__call__.

Returns:

list[str] –

A list of request IDs for the enqueued requests.

`finish_weight_update()` ¶

Finish the current weight update.

`from_engine_args(engine_args)` `classmethod` ¶

Create an LLM instance from EngineArgs.

`generate(prompts, sampling_params=None, *, use_tqdm=True, lora_request=None, priority=None, tokenization_kwargs=None, mm_processor_kwargs=None)` ¶

Generates the completions for the input prompts.

This class automatically batches the given prompts, considering the memory constraint. For the best performance, put all of your prompts into a single list and pass it to this method.

Parameters:

prompts ¶
(PromptType | Sequence[PromptType]) –

The prompts to the LLM. You may pass a sequence of prompts for batch inference. See PromptType for more details about the format of each prompt.
sampling_params ¶
(SamplingParams | Sequence[SamplingParams] | None, default: None ) –

The sampling parameters for text generation. If None, we use the default sampling parameters. When it is a single value, it is applied to every prompt. When it is a list, the list must have the same length as the prompts and it is paired one by one with the prompt.
use_tqdm ¶
(bool | Callable[..., tqdm], default: True ) –

If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.
lora_request ¶
(Sequence[LoRARequest] | LoRARequest | None, default: None ) –

LoRA request to use for generation, if any.
priority ¶
(list[int] | None, default: None ) –

The priority of the requests, if any. Only applicable when priority scheduling policy is enabled. If provided, must be a list of integers matching the length of prompts, where each priority value corresponds to the prompt at the same index.
tokenization_kwargs ¶
(dict[str, Any] | None, default: None ) –

Overrides for tokenizer.encode.
mm_processor_kwargs ¶
(dict[str, Any] | None, default: None ) –

Overrides for processor.__call__.

Returns:

list[RequestOutput] –

A list of RequestOutput objects containing the
list[RequestOutput] –

generated completions in the same order as the input prompts.

`get_metrics()` ¶

Return a snapshot of aggregated metrics from Prometheus.

Returns:

list[Metric] –

A MetricSnapshot instance capturing the current state
list[Metric] –

of all aggregated metrics from Prometheus.

`get_world_size(include_dp=True)` ¶

Get the world size from the parallel config.

Parameters:

include_dp ¶
(bool, default: True ) –

If True (default), returns the world size including data parallelism (TP * PP * DP). If False, returns the world size without data parallelism (TP * PP).

Returns:

int –

The world size (tensor_parallel_size * pipeline_parallel_size),
int –

optionally multiplied by data_parallel_size if include_dp is True.

`init_weight_transfer_engine(request)` ¶

Initialize weight transfer for RL training.

Parameters:

request ¶
(WeightTransferInitRequest | dict) –

Weight transfer initialization request with backend-specific info

`sleep(level=1, mode='abort')` ¶

Put the engine to sleep. The engine should not process any requests. The caller should guarantee that no requests are being processed during the sleep period, before wake_up is called.

Parameters:

level ¶
(int, default: 1 ) –

The sleep level. - Level 0: Pause scheduling but continue accepting requests. Requests are queued but not processed. - Level 1: Offload model weights to CPU, discard KV cache. The content of kv cache is forgotten. Good for sleeping and waking up the engine to run the same model again. Please make sure there's enough CPU memory to store the model weights. - Level 2: Discard all GPU memory (weights + KV cache). Good for sleeping and waking up the engine to run a different model or update the model, where previous model weights are not needed. It reduces CPU memory pressure.
mode ¶
(PauseMode, default: 'abort' ) –

How to handle any existing requests, can be "abort", "wait", or "keep".

`start_profile(profile_prefix=None)` ¶

Start profiling with optional custom trace prefix.

Parameters:

profile_prefix ¶
(str | None, default: None ) –

Optional prefix for the trace file names. If provided, trace files will be named as "

`start_weight_update(is_checkpoint_format=True)` ¶

Start a new weight update.

`update_weights(request)` ¶

Update the weights of the model.

Parameters:

request ¶
(WeightTransferUpdateRequest | dict) –

Weight update request with backend-specific update info

`wait_for_completion(output_type=None, *, use_tqdm=True)` ¶

wait_for_completion(
 *, use_tqdm: bool | Callable[..., tqdm] = True
) -> list[RequestOutput | PoolingRequestOutput]

wait_for_completion(
 output_type: type[_O] | tuple[type[_O], ...],
 *,
 use_tqdm: bool | Callable[..., tqdm] = True,
) -> list[_O]

Wait for all enqueued requests to complete and return results.

This method processes all requests currently in the engine queue and returns their outputs. Use after enqueue() to get results.

Parameters:

output_type ¶
(type[Any] | tuple[type[Any], ...] | None, default: None ) –

The expected output type(s). If not provided, accepts both RequestOutput and PoolingRequestOutput.
use_tqdm ¶
(bool | Callable[..., tqdm], default: True ) –

If True, shows a tqdm progress bar.

Returns:

list[Any] –

A list of output objects for all completed requests.

`wake_up(tags=None)` ¶

Wake up the engine from sleep mode. See the sleep method for more details.

Parameters:

tags ¶
(list[str] | None, default: None ) –

An optional list of tags to reallocate the engine memory for specific memory allocations. Values must be in ("weights", "kv_cache", "scheduling"). If None, all memory is reallocated. wake_up should be called with all tags (or None) before the engine is used again. Use tags=["scheduling"] to resume from level 0 sleep.

`PoolingOutput` `dataclass` ¶

The output data of one pooling output of a request.

Parameters:

data ¶
(Tensor) –

The extracted hidden states.

`PoolingParams` ¶

Bases: Struct

API parameters for pooling models.

Attributes:

use_activation (bool | None) –

Whether to apply activation function to the pooler outputs. None uses the pooler's default, which is True in most cases.
dimensions (int | None) –

Reduce the dimensions of embeddings if model support matryoshka representation.

Methods:

clone –

Returns a deep copy of the PoolingParams instance.

`clone()` ¶

Returns a deep copy of the PoolingParams instance.

`PoolingRequestOutput` ¶

Bases: Generic[_O]

The output data of a pooling request to the LLM.

Parameters:

request_id ¶
(str) –

A unique identifier for the pooling request.
outputs ¶
(PoolingOutput) –

The pooling results for the given input.
prompt_token_ids ¶
(list[int]) –

A list of token IDs used in the prompt.
num_cached_tokens ¶
(int) –

The number of tokens with prefix cache hit.
finished ¶
(bool) –

A flag indicating whether the pooling is completed.

`RequestOutput` ¶

The output data of a completion request to the LLM.

Parameters:

request_id ¶
(str) –

The unique ID of the request.
prompt ¶
(str | None) –

The prompt string of the request. For encoder/decoder models, this is the decoder input prompt.
prompt_token_ids ¶
(list[int] | None) –

The token IDs of the prompt. For encoder/decoder models, this is the decoder input prompt token ids.
prompt_logprobs ¶
(PromptLogprobs | None) –

The log probabilities to return per prompt token.
outputs ¶
(list[CompletionOutput]) –

The output sequences of the request.
finished ¶
(bool) –

Whether the whole request is finished.
metrics ¶
(RequestStateStats | None, default: None ) –

Metrics associated with the request.
lora_request ¶
(LoRARequest | None, default: None ) –

The LoRA request that was used to generate the output.
encoder_prompt ¶
(str | None, default: None ) –

The encoder prompt string of the request. None if decoder-only.
encoder_prompt_token_ids ¶
(list[int] | None, default: None ) –

The token IDs of the encoder prompt. None if decoder-only.
num_cached_tokens ¶
(int | None, default: None ) –

The number of tokens with prefix cache hit.
kv_transfer_params ¶
(dict[str, Any] | None, default: None ) –

The params for remote K/V transfer.

Methods:

add –

Merge subsequent RequestOutput into this one

`add(next_output, aggregate)` ¶

Merge subsequent RequestOutput into this one

`SamplingParams` ¶

Bases: PydanticMsgspecMixin, Struct

Sampling parameters for text generation.

Overall, we follow the sampling parameters from the OpenAI text completion API (https://platform.openai.com/docs/api-reference/completions/create). In addition, we support beam search, which is not supported by OpenAI.

Methods:

clone –

If skip_clone is True, uses shallow copy instead of deep copy.
for_sampler_warmup –

Set parameters to exercise all sampler logic.
update_from_generation_config –

Update if there are non-default values from generation_config

Attributes:

allowed_token_ids (list[int] | None) –

If provided, the engine will construct a logits processor which only
bad_words (list[str] | None) –

Words that are not allowed to be generated. More precisely, only the
detokenize (bool) –

Whether to detokenize the output.
extra_args (dict[str, Any] | None) –

Arbitrary additional args, that can be used by custom sampling
flat_logprobs (bool) –

Whether to return logprobs in flatten format (i.e. FlatLogprob)
frequency_penalty (float) –

Penalizes new tokens based on their frequency in the generated text so
ignore_eos (bool) –

Whether to ignore the EOS token and continue generating
include_stop_str_in_output (bool) –

Whether to include the stop strings in output text.
logit_bias (dict[int, float] | None) –

If provided, the engine will construct a logits processor that applies
logprob_token_ids (list[int] | None) –

Specific token IDs to return logprobs for. More efficient than
logprobs (int | None) –

Number of log probabilities to return per output token. When set to
max_tokens (int | None) –

Maximum number of tokens to generate per output sequence.
min_p (float) –

Represents the minimum probability for a token to be considered,
min_tokens (int) –

Minimum number of tokens to generate per output sequence before EOS or
n (int) –

Number of outputs to return for the given prompt request.
num_logprobs (int | None) –

Number of sample logprobs to return per output token, or None if
presence_penalty (float) –

Penalizes new tokens based on whether they appear in the generated text
prompt_logprobs (int | None) –

Number of log probabilities to return per prompt token.
repetition_detection (RepetitionDetectionParams | None) –

Parameters for detecting repetitive N-gram patterns in output tokens.
repetition_penalty (float) –

Penalizes new tokens based on whether they appear in the prompt and the
routed_experts_prompt_start (int) –

When enable_return_routed_experts is active, skip the first
seed (int | None) –

Random seed to use for the generation.
skip_clone (bool) –

Internal flag indicating that this SamplingParams instance is safe to
skip_special_tokens (bool) –

Whether to skip special tokens in the output.
spaces_between_special_tokens (bool) –

Whether to add spaces between special tokens in the output.
stop (str | list[str] | None) –

String(s) that stop the generation when they are generated. The returned
stop_token_ids (list[int] | None) –

Token IDs that stop the generation when they are generated. The returned
structured_outputs (StructuredOutputsParams | None) –

Parameters for configuring structured outputs.
temperature (float) –

Controls the randomness of the sampling. Lower values make the model
thinking_token_budget (int | None) –

Maximum number of tokens allowed for thinking operations.
top_k (int) –

Controls the number of top tokens to consider. Set to 0 (or -1) to
top_p (float) –

Controls the cumulative probability of the top tokens to consider. Must

`allowed_token_ids = None` `class-attribute` `instance-attribute` ¶

If provided, the engine will construct a logits processor which only retains scores for the given token ids.

`bad_words = None` `class-attribute` `instance-attribute` ¶

Words that are not allowed to be generated. More precisely, only the last token of a corresponding token sequence is not allowed when the next generated token can complete the sequence.

`detokenize = True` `class-attribute` `instance-attribute` ¶

Whether to detokenize the output.

`extra_args = None` `class-attribute` `instance-attribute` ¶

Arbitrary additional args, that can be used by custom sampling implementations, plugins, etc. Not used by any in-tree sampling implementations.

`flat_logprobs = False` `class-attribute` `instance-attribute` ¶

Whether to return logprobs in flatten format (i.e. FlatLogprob) for better performance. NOTE: GC costs of FlatLogprobs is significantly smaller than list[dict[int, Logprob]]. After enabled, PromptLogprobs and SampleLogprobs would populated as FlatLogprobs.

`frequency_penalty = 0.0` `class-attribute` `instance-attribute` ¶

Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.

`ignore_eos = False` `class-attribute` `instance-attribute` ¶

Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.

`include_stop_str_in_output = False` `class-attribute` `instance-attribute` ¶

Whether to include the stop strings in output text.

`logit_bias = None` `class-attribute` `instance-attribute` ¶

If provided, the engine will construct a logits processor that applies these logit biases.

`logprob_token_ids = None` `class-attribute` `instance-attribute` ¶

Specific token IDs to return logprobs for. More efficient than logprobs=-1 when you only need logprobs for a small set of tokens. When set, logprobs for exactly these token IDs will be returned, in addition to the sampled token. This is useful for scoring tasks where you want to compare probabilities of specific label tokens.

`logprobs = None` `class-attribute` `instance-attribute` ¶

Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens. Note that the implementation follows the OpenAI API: The API will always return the log probability of the sampled token, so there may be up to logprobs+1 elements in the response. When set to -1, return all vocab_size log probabilities.

`max_tokens = 16` `class-attribute` `instance-attribute` ¶

Maximum number of tokens to generate per output sequence.

`min_p = 0.0` `class-attribute` `instance-attribute` ¶

Represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

`min_tokens = 0` `class-attribute` `instance-attribute` ¶

Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated

`n = 1` `class-attribute` `instance-attribute` ¶

Number of outputs to return for the given prompt request.

The maximum allowed value is controlled by the VLLM_MAX_N_SEQUENCES environment variable (default: 16384).

`num_logprobs` `property` ¶

Number of sample logprobs to return per output token, or None if no sample logprobs were requested. Takes logprob_token_ids into account: when logprobs is unset but logprob_token_ids is set, returns len(logprob_token_ids).

`presence_penalty = 0.0` `class-attribute` `instance-attribute` ¶

Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.

`prompt_logprobs = None` `class-attribute` `instance-attribute` ¶

Number of log probabilities to return per prompt token. When set to -1, return all vocab_size log probabilities.

`repetition_detection = None` `class-attribute` `instance-attribute` ¶

Parameters for detecting repetitive N-gram patterns in output tokens. If such repetition is detected, generation will be ended early. LLMs can sometimes generate repetitive, unhelpful token patterns, stopping only when they hit the maximum output length (e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature can detect such behavior and terminate early, saving time and tokens.

`repetition_penalty = 1.0` `class-attribute` `instance-attribute` ¶

Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.

`routed_experts_prompt_start = 0` `class-attribute` `instance-attribute` ¶

When enable_return_routed_experts is active, skip the first routed_experts_prompt_start prompt tokens from the returned routing data. In multi-turn agent scenarios, set this to the length of the already-returned prefix to avoid duplicating routing for prompt tokens covered by earlier turns. Default 0 returns routing for all prompt tokens.

`seed = None` `class-attribute` `instance-attribute` ¶

Random seed to use for the generation.

`skip_clone = False` `class-attribute` `instance-attribute` ¶

Internal flag indicating that this SamplingParams instance is safe to reuse without cloning. When True, clone() will return self without performing a deep copy. This should only be set when the params object is guaranteed to be dedicated to a single request and won't be modified in ways that would affect other uses.

`skip_special_tokens = True` `class-attribute` `instance-attribute` ¶

Whether to skip special tokens in the output.

`spaces_between_special_tokens = True` `class-attribute` `instance-attribute` ¶

Whether to add spaces between special tokens in the output.

`stop = None` `class-attribute` `instance-attribute` ¶

String(s) that stop the generation when they are generated. The returned output will not contain the stop strings.

`stop_token_ids = None` `class-attribute` `instance-attribute` ¶

Token IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.

`structured_outputs = None` `class-attribute` `instance-attribute` ¶

Parameters for configuring structured outputs.

`temperature = 1.0` `class-attribute` `instance-attribute` ¶

Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.

`thinking_token_budget = None` `class-attribute` `instance-attribute` ¶

Maximum number of tokens allowed for thinking operations.

`top_k = 0` `class-attribute` `instance-attribute` ¶

Controls the number of top tokens to consider. Set to 0 (or -1) to consider all tokens.

`top_p = 1.0` `class-attribute` `instance-attribute` ¶

Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

`_validate_logit_bias(model_config)` ¶

Validate logit_bias token IDs are within vocabulary range.

`clone()` ¶

If skip_clone is True, uses shallow copy instead of deep copy.

`for_sampler_warmup()` `staticmethod` ¶

Set parameters to exercise all sampler logic.

`update_from_generation_config(generation_config, eos_token_id=None)` ¶

Update if there are non-default values from generation_config

`ScoringOutput` `dataclass` ¶

The output data of one scoring output of a request.

Parameters:

score ¶
(float) –

The similarity score, which is a scalar value.

`TextPrompt` ¶

Bases: _PromptOptions

Schema for a text prompt.

Attributes:

prompt (str) –

The input text to be tokenized before passing to the model.

`prompt` `instance-attribute` ¶

The input text to be tokenized before passing to the model.

`TokensPrompt` ¶

Bases: _PromptOptions

Schema for a tokenized prompt.

Attributes:

prompt (NotRequired[str]) –

The prompt text corresponding to the token IDs, if available.
prompt_token_ids (list[int]) –

A list of token IDs to pass to the model.
token_type_ids (NotRequired[list[int]]) –

A list of token type IDs to pass to the cross encoder model.

`prompt` `instance-attribute` ¶

The prompt text corresponding to the token IDs, if available.

`prompt_token_ids` `instance-attribute` ¶

A list of token IDs to pass to the model.

`token_type_ids` `instance-attribute` ¶

A list of token type IDs to pass to the cross encoder model.

`initialize_ray_cluster(parallel_config, ray_address=None, require_gpu_on_driver=True)` ¶

Initialize the distributed cluster with Ray.

it will connect to the Ray cluster and create a placement group for the workers, which includes the specification of the resources for each distributed worker.

Parameters:

parallel_config ¶
(ParallelConfig) –

The configurations for parallel execution.
ray_address ¶
(str | None, default: None ) –

The address of the Ray cluster. If None, uses the default Ray cluster address.
require_gpu_on_driver ¶
(bool, default: True ) –

If True (default), require at least one GPU on the current (driver) node and pin the first PG bundle to it. Set to False for executors like RayExecutorV2 where all GPU work is delegated to remote Ray actors.

URL: https://docs.vllm.ai/en/latest/api/vllm/index.html

⇱ vllm - vLLM

vllm ¶

AsyncLLMEngine = AsyncLLM module-attribute ¶

LLMEngine = V1LLMEngine module-attribute ¶

PromptType = DecoderOnlyPrompt | EncoderDecoderPrompt module-attribute ¶

AsyncEngineArgs dataclass ¶

ClassificationOutput dataclass ¶

probs ¶

CompletionOutput dataclass ¶

index ¶

text ¶

token_ids ¶

cumulative_logprob ¶

logprobs ¶

finish_reason ¶

stop_reason ¶

lora_request ¶

EmbeddingOutput dataclass ¶

embedding ¶

EngineArgs dataclass ¶

logits_processors = ModelConfig.logits_processors class-attribute instance-attribute ¶

quantization_config = None class-attribute instance-attribute ¶

_check_feature_supported() ¶

_get_min_mm_batched_tokens(model_config) staticmethod ¶

add_cli_args(parser) staticmethod ¶

create_engine_config(usage_context=None, headless=False) ¶

create_speculative_config(target_model_config, target_parallel_config) ¶

LLM ¶

model ¶

tokenizer ¶

tokenizer_mode ¶

skip_tokenizer_init ¶

trust_remote_code ¶

allowed_local_media_path ¶

allowed_media_domains ¶

tensor_parallel_size ¶

dtype ¶

quantization ¶

revision ¶

tokenizer_revision ¶

chat_template ¶

seed ¶

gpu_memory_utilization ¶

kv_cache_memory_bytes ¶

cpu_offload_gb ¶

offload_group_size ¶

offload_num_in_group ¶

offload_prefetch_step ¶

offload_params ¶

enforce_eager ¶

enable_return_routed_experts ¶

disable_custom_all_reduce ¶

hf_token ¶

hf_overrides ¶

mm_processor_kwargs ¶

pooler_config ¶

compilation_config ¶

attention_config ¶

spec_method ¶

spec_model ¶

spec_tokens ¶

**kwargs ¶

__repr__() ¶

apply_model(func) ¶

chat(messages, sampling_params=None, use_tqdm=True, lora_request=None, chat_template=None, chat_template_content_format='auto', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None, tokenization_kwargs=None, mm_processor_kwargs=None) ¶

messages ¶

sampling_params ¶

use_tqdm ¶

lora_request ¶

chat_template ¶

chat_template_content_format ¶

add_generation_prompt ¶

continue_final_message ¶

chat_template_kwargs ¶

tokenization_kwargs ¶

mm_processor_kwargs ¶

collective_rpc(method, timeout=None, args=(), kwargs=None) ¶

method ¶

timeout ¶

`vllm` ¶

`AsyncLLMEngine = AsyncLLM` `module-attribute` ¶

`LLMEngine = V1LLMEngine` `module-attribute` ¶

`PromptType = DecoderOnlyPrompt | EncoderDecoderPrompt` `module-attribute` ¶

`AsyncEngineArgs` `dataclass` ¶

`ClassificationOutput` `dataclass` ¶

`probs` ¶

`CompletionOutput` `dataclass` ¶

`index` ¶

`text` ¶

`token_ids` ¶

`cumulative_logprob` ¶

`logprobs` ¶

`finish_reason` ¶

`stop_reason` ¶

`lora_request` ¶

`EmbeddingOutput` `dataclass` ¶

`embedding` ¶

`EngineArgs` `dataclass` ¶

`logits_processors = ModelConfig.logits_processors` `class-attribute` `instance-attribute` ¶

`quantization_config = None` `class-attribute` `instance-attribute` ¶

`_check_feature_supported()` ¶

`_get_min_mm_batched_tokens(model_config)` `staticmethod` ¶

`add_cli_args(parser)` `staticmethod` ¶

`create_engine_config(usage_context=None, headless=False)` ¶

`create_speculative_config(target_model_config, target_parallel_config)` ¶

`LLM` ¶

`model` ¶

`tokenizer` ¶

`tokenizer_mode` ¶

`skip_tokenizer_init` ¶

`trust_remote_code` ¶

`allowed_local_media_path` ¶

`allowed_media_domains` ¶

`tensor_parallel_size` ¶

`dtype` ¶

`quantization` ¶

`revision` ¶

`tokenizer_revision` ¶

`chat_template` ¶

`seed` ¶

`gpu_memory_utilization` ¶

`kv_cache_memory_bytes` ¶

`cpu_offload_gb` ¶

`offload_group_size` ¶

`offload_num_in_group` ¶

`offload_prefetch_step` ¶

`offload_params` ¶

`enforce_eager` ¶

`enable_return_routed_experts` ¶

`disable_custom_all_reduce` ¶

`hf_token` ¶

`hf_overrides` ¶

`mm_processor_kwargs` ¶

`pooler_config` ¶

`compilation_config` ¶

`attention_config` ¶

`spec_method` ¶

`spec_model` ¶

`spec_tokens` ¶

`kwargs`** ¶

`repr()` ¶

`apply_model(func)` ¶

`chat(messages, sampling_params=None, use_tqdm=True, lora_request=None, chat_template=None, chat_template_content_format='auto', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None, tokenization_kwargs=None, mm_processor_kwargs=None)` ¶

`messages` ¶

`sampling_params` ¶

`use_tqdm` ¶

`lora_request` ¶

`chat_template` ¶

`chat_template_content_format` ¶

`add_generation_prompt` ¶

`continue_final_message` ¶

`chat_template_kwargs` ¶

`tokenization_kwargs` ¶

`mm_processor_kwargs` ¶

`collective_rpc(method, timeout=None, args=(), kwargs=None)` ¶

`method` ¶

`timeout` ¶

`args` ¶

`kwargs` ¶