VOOZH about

URL: https://docs.vllm.ai/en/latest/api/vllm/index.html

⇱ vllm - vLLM


Skip to content

vllm

vLLM: a high-throughput and memory-efficient inference engine for LLMs

Modules:

Classes:

Functions:

Attributes:

AsyncLLMEngine = AsyncLLM module-attribute

The AsyncLLMEngine class is an alias of vllm.v1.engine.async_llm.AsyncLLM.

LLMEngine = V1LLMEngine module-attribute

The LLMEngine class is an alias of vllm.v1.engine.llm_engine.LLMEngine.

PromptType = DecoderOnlyPrompt | EncoderDecoderPrompt module-attribute

Schema for any prompt, regardless of model type.

This is the input format accepted by most LLM APIs.

AsyncEngineArgs dataclass

Bases: EngineArgs

Arguments for asynchronous vLLM engine.

ClassificationOutput dataclass

The output data of one classification output of a request.

Parameters:

  • probs

    (list[float]) –

    The probability vector, which is a list of floats. Its length depends on the number of classes.

CompletionOutput dataclass

The output data of one completion output of a request.

Parameters:

  • index

    (int) –

    The index of the output in the request.

  • text

    (str) –

    The generated output text.

  • token_ids

    (Sequence[int]) –

    The token IDs of the generated output text.

  • cumulative_logprob

    (float | None) –

    The cumulative log probability of the generated output text.

  • logprobs

    (SampleLogprobs | None) –

    The log probabilities of the top probability words at each position if the logprobs are requested.

  • finish_reason

    (str | None, default: None ) –

    The reason why the sequence is finished.

  • stop_reason

    (int | str | None, default: None ) –

    The stop string or token id that caused the completion to stop, None if the completion finished for some other reason including encountering the EOS token.

  • lora_request

    (LoRARequest | None, default: None ) –

    The LoRA request that was used to generate the output.

EmbeddingOutput dataclass

The output data of one embedding output of a request.

Parameters:

  • embedding

    (list[float]) –

    The embedding vector, which is a list of floats. Its length depends on the hidden dimension of the model.

EngineArgs dataclass

Arguments for vLLM engine.

Methods:

Attributes:

logits_processors = ModelConfig.logits_processors class-attribute instance-attribute

Custom logitproc types

quantization_config = None class-attribute instance-attribute

User-facing quantization configuration. Carries per-layer-kind QuantSpecs (linear, moe) and ignore patterns; see :class:QuantizationConfigArgs. Auto-populated from the matching online shorthand when quantization is one of the values in ONLINE_QUANT_SHORTHAND_NAMES.

_check_feature_supported()

Raise an error if the feature is not supported.

_get_min_mm_batched_tokens(model_config) staticmethod

Get the minimum max_num_batched_tokens needed for a multimodal prefix-LM model to process at least one item of any supported modality.

Returns (token_count, modality_name) for the most expensive modality, or None if the value cannot be determined at this stage.

add_cli_args(parser) staticmethod

Shared CLI arguments for vLLM engine.

create_engine_config(usage_context=None, headless=False)

Create the VllmConfig.

NOTE: If VllmConfig is incompatible, we raise an error.

create_speculative_config(target_model_config, target_parallel_config)

Initializes and returns a SpeculativeConfig object based on speculative_config.

LLM

Bases: BeamSearchOfflineMixin, PoolingOfflineMixin, OfflineInferenceMixin

An LLM for generating texts from given prompts and sampling parameters.

This class includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). Given a batch of prompts and sampling parameters, this class generates texts from the model, using an intelligent batching mechanism and efficient memory management.

Parameters:

  • model

    (str) –

    The name or path of a HuggingFace Transformers model.

  • tokenizer

    (str | None, default: None ) –

    The name or path of a HuggingFace Transformers tokenizer.

  • tokenizer_mode

    (TokenizerMode | str, default: 'auto' ) –

    The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.

  • skip_tokenizer_init

    (bool, default: False ) –

    If true, skip initialization of tokenizer and detokenizer. Expect valid prompt_token_ids and None for prompt from the input.

  • trust_remote_code

    (bool, default: False ) –

    Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer.

  • allowed_local_media_path

    (str, default: '' ) –

    Allowing API requests to read local images or videos from directories specified by the server file system. This is a security risk. Should only be enabled in trusted environments.

  • allowed_media_domains

    (list[str] | None, default: None ) –

    If set, only media URLs that belong to this domain can be used for multi-modal inputs.

  • tensor_parallel_size

    (int, default: 1 ) –

    The number of GPUs to use for distributed execution with tensor parallelism.

  • dtype

    (ModelDType, default: 'auto' ) –

    The data type for the model weights and activations. Currently, we support float32, float16, and bfloat16. If auto, we use the dtype attribute of the Transformers model's config. However, if the dtype in the config is float32, we will use float16 instead.

  • quantization

    (QuantizationMethods | None, default: None ) –

    The method used to quantize the model weights. Currently, we support "awq", "gptq", and "fp8" (experimental). If None, we first check the quantization_config attribute in the model config file. If that is None, we assume the model weights are not quantized and use dtype to determine the data type of the weights.

  • revision

    (str | None, default: None ) –

    The specific model version to use. It can be a branch name, a tag name, or a commit id.

  • tokenizer_revision

    (str | None, default: None ) –

    The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id.

  • chat_template

    (Path | str | None, default: None ) –

    The chat template to apply.

  • seed

    (int, default: 0 ) –

    The seed to initialize the random number generator for sampling.

  • gpu_memory_utilization

    (float, default: 0.92 ) –

    The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Higher values will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, it may cause out-of- memory (OOM) errors.

  • kv_cache_memory_bytes

    (int | None, default: None ) –

    Size of KV Cache per GPU in bytes. By default, this is set to None and vllm can automatically infer the kv cache size based on gpu_memory_utilization. However, users may want to manually specify the kv cache memory size. kv_cache_memory_bytes allows more fine-grain control of how much memory gets used when compared with using gpu_memory_utilization. Note that kv_cache_memory_bytes (when not-None) ignores gpu_memory_utilization

  • cpu_offload_gb

    (float, default: 0 ) –

    The size (GiB) of CPU memory to use for offloading the model weights. This virtually increases the GPU memory space you can use to hold the model weights, at the cost of CPU-GPU data transfer for every forward pass.

  • offload_group_size

    (int, default: 0 ) –

    Prefetch offloading: Group every N layers together. Offload last offload_num_in_group layers of each group. Default is 0 (disabled).

  • offload_num_in_group

    (int, default: 1 ) –

    Prefetch offloading: Number of layers to offload per group. Default is 1.

  • offload_prefetch_step

    (int, default: 1 ) –

    Prefetch offloading: Number of layers to prefetch ahead. Higher values hide more latency but use more GPU memory. Default is 1.

  • offload_params

    (set[str] | None, default: None ) –

    Prefetch offloading: Set of parameter name segments to selectively offload. Only parameters whose names contain one of these segments will be offloaded (e.g., {"gate_up_proj", "down_proj"} for MLP weights, or {"w13_weight", "w2_weight"} for MoE expert weights). If None or empty, all parameters are offloaded.

  • enforce_eager

    (bool, default: False ) –

    Whether to enforce eager execution. If True, we will disable CUDA graph and always execute the model in eager mode. If False, we will use CUDA graph and eager execution in hybrid.

  • enable_return_routed_experts

    (bool, default: False ) –

    Whether to return routed experts.

  • disable_custom_all_reduce

    (bool, default: False ) –
  • hf_token

    (bool | str | None, default: None ) –

    The token to use as HTTP bearer authorization for remote files . If True, will use the token generated when running hf auth login (stored in ~/.cache/huggingface/token).

  • hf_overrides

    (HfOverrides | None, default: None ) –

    If a dictionary, contains arguments to be forwarded to the HuggingFace config. If a callable, it is called to update the HuggingFace config.

  • mm_processor_kwargs

    (dict[str, Any] | None, default: None ) –

    Arguments to be forwarded to the model's processor for multi-modal data, e.g., image processor. Overrides for the multi-modal processor obtained from AutoProcessor.from_pretrained. The available overrides depend on the model that is being run. For example, for Phi-3-Vision: {"num_crops": 4}.

  • pooler_config

    (PoolerConfig | None, default: None ) –

    Initialize non-default pooling config for the pooling model, e.g., PoolerConfig(seq_pooling_type="MEAN", use_activation=False).

  • compilation_config

    (int | dict[str, Any] | CompilationConfig | None, default: None ) –

    Either an integer or a dictionary. If it is an integer, it is used as the mode of compilation optimization. If it is a dictionary, it can specify the full compilation configuration.

  • attention_config

    (dict[str, Any] | AttentionConfig | None, default: None ) –

    Configuration for attention mechanisms. Can be a dictionary or an AttentionConfig instance. If a dictionary, it will be converted to an AttentionConfig. Allows specifying the attention backend and other attention-related settings.

  • spec_method

    (str | None, default: None ) –

    Top-level alias for speculative_config["method"].

  • spec_model

    (str | None, default: None ) –

    Top-level alias for speculative_config["model"].

  • spec_tokens

    (int | None, default: None ) –

    Top-level alias for speculative_config["num_speculative_tokens"].

  • **kwargs

    (Any, default: {} ) –

    Arguments for EngineArgs.

Methods:

  • __init__

    LLM constructor.

  • __repr__

    Return a transformers-style hierarchical view of the model.

  • apply_model

    Run a function directly on the model inside each worker,

  • chat

    Generate responses for a chat conversation.

  • collective_rpc

    Execute an RPC call on all workers.

  • enqueue

    Enqueue prompts for generation without waiting for completion.

  • enqueue_chat

    Enqueue chat conversations for generation without waiting.

  • finish_weight_update

    Finish the current weight update.

  • from_engine_args

    Create an LLM instance from EngineArgs.

  • generate

    Generates the completions for the input prompts.

  • get_metrics

    Return a snapshot of aggregated metrics from Prometheus.

  • get_world_size

    Get the world size from the parallel config.

  • init_weight_transfer_engine

    Initialize weight transfer for RL training.

  • sleep

    Put the engine to sleep. The engine should not process any requests.

  • start_profile

    Start profiling with optional custom trace prefix.

  • start_weight_update

    Start a new weight update.

  • update_weights

    Update the weights of the model.

  • wait_for_completion

    Wait for all enqueued requests to complete and return results.

  • wake_up

    Wake up the engine from sleep mode. See the sleep

__init__(model, *, runner='auto', convert='auto', tokenizer=None, tokenizer_mode='auto', skip_tokenizer_init=False, trust_remote_code=False, allowed_local_media_path='', allowed_media_domains=None, tensor_parallel_size=1, dtype='auto', quantization=None, revision=None, tokenizer_revision=None, chat_template=None, seed=0, gpu_memory_utilization=0.92, cpu_offload_gb=0, offload_group_size=0, offload_num_in_group=1, offload_prefetch_step=1, offload_params=None, enforce_eager=False, enable_return_routed_experts=False, disable_custom_all_reduce=False, hf_token=None, hf_overrides=None, mm_processor_kwargs=None, pooler_config=None, structured_outputs_config=None, profiler_config=None, attention_config=None, kv_cache_memory_bytes=None, compilation_config=None, quantization_config=None, logits_processors=None, spec_method=None, spec_model=None, spec_tokens=None, **kwargs)

LLM constructor.

__repr__()

Return a transformers-style hierarchical view of the model.

apply_model(func)

Run a function directly on the model inside each worker, returning the result for each of them.

Warning

To reduce the overhead of data transfer, avoid returning large arrays or tensors from this method. If you must return them, make sure you move them to CPU first to avoid taking up additional VRAM!

chat(messages, sampling_params=None, use_tqdm=True, lora_request=None, chat_template=None, chat_template_content_format='auto', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None, tokenization_kwargs=None, mm_processor_kwargs=None)

Generate responses for a chat conversation.

The chat conversation is converted into a text prompt using the tokenizer and calls the generate method to generate the responses.

Multi-modal inputs can be passed in the same way you would pass them to the OpenAI API.

Parameters:

  • messages

    (list[ChatCompletionMessageParam] | Sequence[list[ChatCompletionMessageParam]]) –

    A sequence of conversations or a single conversation.

    • Each conversation is represented as a list of messages.
    • Each message is a dictionary with 'role' and 'content' keys.
  • sampling_params

    (SamplingParams | Sequence[SamplingParams] | None, default: None ) –

    The sampling parameters for text generation. If None, we use the default sampling parameters. When it is a single value, it is applied to every prompt. When it is a list, the list must have the same length as the prompts and it is paired one by one with the prompt.

  • use_tqdm

    (bool | Callable[..., tqdm], default: True ) –

    If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.

  • lora_request

    (Sequence[LoRARequest] | LoRARequest | None, default: None ) –

    LoRA request to use for generation, if any.

  • chat_template

    (str | None, default: None ) –

    The template to use for structuring the chat. If not provided, the model's default chat template will be used.

  • chat_template_content_format

    (ChatTemplateContentFormatOption, default: 'auto' ) –

    The format to render message content.

    • "string" will render the content as a string. Example: "Who are you?"
    • "openai" will render the content as a list of dictionaries, similar to OpenAI schema. Example: [{"type": "text", "text": "Who are you?"}]
  • add_generation_prompt

    (bool, default: True ) –

    If True, adds a generation template to each message.

  • continue_final_message

    (bool, default: False ) –

    If True, continues the final message in the conversation instead of starting a new one. Cannot be True if add_generation_prompt is also True.

  • chat_template_kwargs

    (dict[str, Any] | None, default: None ) –

    Additional kwargs to pass to the chat template.

  • tokenization_kwargs

    (dict[str, Any] | None, default: None ) –

    Overrides for tokenizer.encode.

  • mm_processor_kwargs

    (dict[str, Any] | None, default: None ) –

    Overrides for processor.__call__.

Returns:

collective_rpc(method, timeout=None, args=(), kwargs=None)

Execute an RPC call on all workers.

Parameters:

  • method

    (str | Callable[..., _R]) –

    Name of the worker method to execute, or a callable that is serialized and sent to all workers to execute.

    If the method is a callable, it should accept an additional self argument, in addition to the arguments passed in args and kwargs. The self argument will be the worker object.

  • timeout

    (float | None, default: None ) –

    Maximum time in seconds to wait for execution. Raises a TimeoutError on timeout. None means wait indefinitely.

  • args

    (tuple, default: () ) –

    Positional arguments to pass to the worker method.

  • kwargs

    (dict[str, Any] | None, default: None ) –

    Keyword arguments to pass to the worker method.

Returns:

  • list[_R]

    A list containing the results from each worker.

enqueue(prompts, sampling_params=None, lora_request=None, priority=None, use_tqdm=True, tokenization_kwargs=None, mm_processor_kwargs=None)

Enqueue prompts for generation without waiting for completion.

This method adds requests to the engine queue but does not start processing them. Use wait_for_completion() to process the queued requests and get results.

Parameters:

  • prompts

    (PromptType | Sequence[PromptType]) –

    The prompts to the LLM. See generate() for details.

  • sampling_params

    (SamplingParams | Sequence[SamplingParams] | None, default: None ) –

    The sampling parameters for text generation.

  • lora_request

    (Sequence[LoRARequest] | LoRARequest | None, default: None ) –

    LoRA request to use for generation, if any.

  • priority

    (list[int] | None, default: None ) –

    The priority of the requests, if any.

  • use_tqdm

    (bool | Callable[..., tqdm], default: True ) –

    If True, shows a tqdm progress bar while adding requests.

  • tokenization_kwargs

    (dict[str, Any] | None, default: None ) –

    Overrides for tokenizer.encode.

  • mm_processor_kwargs

    (dict[str, Any] | None, default: None ) –

    Overrides for processor.__call__.

Returns:

  • list[str]

    A list of request IDs for the enqueued requests.

enqueue_chat(messages, sampling_params=None, use_tqdm=True, lora_request=None, priority=None, chat_template=None, chat_template_content_format='auto', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None, tokenization_kwargs=None, mm_processor_kwargs=None)

Enqueue chat conversations for generation without waiting.

This method renders chat conversations and adds the resulting requests to the engine queue. Use wait_for_completion() to get results. To guarantee that all requests are queued before scheduling starts, pause scheduling with sleep(level=0) before calling this method and resume it with wake_up(tags=["scheduling"]) afterward.

Parameters:

  • messages

    (list[ChatCompletionMessageParam] | Sequence[list[ChatCompletionMessageParam]]) –

    A sequence of conversations or a single conversation. Each conversation is represented as a list of messages.

  • sampling_params

    (SamplingParams | Sequence[SamplingParams] | None, default: None ) –

    The sampling parameters for text generation. If None, we use the default sampling parameters.

  • use_tqdm

    (bool | Callable[..., tqdm], default: True ) –

    If True, shows a tqdm progress bar while rendering conversations.

  • lora_request

    (Sequence[LoRARequest] | LoRARequest | None, default: None ) –

    LoRA request to use for generation, if any.

  • priority

    (list[int] | None, default: None ) –

    The priority of the requests, if any.

  • chat_template

    (str | None, default: None ) –

    The template to use for structuring the chat.

  • chat_template_content_format

    (ChatTemplateContentFormatOption, default: 'auto' ) –

    The format to render message content.

  • add_generation_prompt

    (bool, default: True ) –

    If True, adds a generation template to each message.

  • continue_final_message

    (bool, default: False ) –

    If True, continues the final message in the conversation instead of starting a new one.

  • tools

    (list[dict[str, Any]] | None, default: None ) –

    Tools to make available to the model, if any.

  • chat_template_kwargs

    (dict[str, Any] | None, default: None ) –

    Additional kwargs to pass to the chat template.

  • tokenization_kwargs

    (dict[str, Any] | None, default: None ) –

    Overrides for tokenizer.encode.

  • mm_processor_kwargs

    (dict[str, Any] | None, default: None ) –

    Overrides for processor.__call__.

Returns:

  • list[str]

    A list of request IDs for the enqueued requests.

finish_weight_update()

Finish the current weight update.

from_engine_args(engine_args) classmethod

Create an LLM instance from EngineArgs.

generate(prompts, sampling_params=None, *, use_tqdm=True, lora_request=None, priority=None, tokenization_kwargs=None, mm_processor_kwargs=None)

Generates the completions for the input prompts.

This class automatically batches the given prompts, considering the memory constraint. For the best performance, put all of your prompts into a single list and pass it to this method.

Parameters:

  • prompts

    (PromptType | Sequence[PromptType]) –

    The prompts to the LLM. You may pass a sequence of prompts for batch inference. See PromptType for more details about the format of each prompt.

  • sampling_params

    (SamplingParams | Sequence[SamplingParams] | None, default: None ) –

    The sampling parameters for text generation. If None, we use the default sampling parameters. When it is a single value, it is applied to every prompt. When it is a list, the list must have the same length as the prompts and it is paired one by one with the prompt.

  • use_tqdm

    (bool | Callable[..., tqdm], default: True ) –

    If True, shows a tqdm progress bar. If a callable (e.g., functools.partial(tqdm, leave=False)), it is used to create the progress bar. If False, no progress bar is created.

  • lora_request

    (Sequence[LoRARequest] | LoRARequest | None, default: None ) –

    LoRA request to use for generation, if any.

  • priority

    (list[int] | None, default: None ) –

    The priority of the requests, if any. Only applicable when priority scheduling policy is enabled. If provided, must be a list of integers matching the length of prompts, where each priority value corresponds to the prompt at the same index.

  • tokenization_kwargs

    (dict[str, Any] | None, default: None ) –

    Overrides for tokenizer.encode.

  • mm_processor_kwargs

    (dict[str, Any] | None, default: None ) –

    Overrides for processor.__call__.

Returns:

get_metrics()

Return a snapshot of aggregated metrics from Prometheus.

Returns:

  • list[Metric]

    A MetricSnapshot instance capturing the current state

  • list[Metric]

    of all aggregated metrics from Prometheus.

get_world_size(include_dp=True)

Get the world size from the parallel config.

Parameters:

  • include_dp

    (bool, default: True ) –

    If True (default), returns the world size including data parallelism (TP * PP * DP). If False, returns the world size without data parallelism (TP * PP).

Returns:

  • int

    The world size (tensor_parallel_size * pipeline_parallel_size),

  • int

    optionally multiplied by data_parallel_size if include_dp is True.

init_weight_transfer_engine(request)

Initialize weight transfer for RL training.

Parameters:

sleep(level=1, mode='abort')

Put the engine to sleep. The engine should not process any requests. The caller should guarantee that no requests are being processed during the sleep period, before wake_up is called.

Parameters:

  • level

    (int, default: 1 ) –

    The sleep level. - Level 0: Pause scheduling but continue accepting requests. Requests are queued but not processed. - Level 1: Offload model weights to CPU, discard KV cache. The content of kv cache is forgotten. Good for sleeping and waking up the engine to run the same model again. Please make sure there's enough CPU memory to store the model weights. - Level 2: Discard all GPU memory (weights + KV cache). Good for sleeping and waking up the engine to run a different model or update the model, where previous model weights are not needed. It reduces CPU memory pressure.

  • mode

    (PauseMode, default: 'abort' ) –

    How to handle any existing requests, can be "abort", "wait", or "keep".

start_profile(profile_prefix=None)

Start profiling with optional custom trace prefix.

Parameters:

  • profile_prefix

    (str | None, default: None ) –

    Optional prefix for the trace file names. If provided, trace files will be named as "

start_weight_update(is_checkpoint_format=True)

Start a new weight update.

update_weights(request)

Update the weights of the model.

Parameters:

wait_for_completion(output_type=None, *, use_tqdm=True)

wait_for_completion(
 *, use_tqdm: bool | Callable[..., tqdm] = True
) -> list[RequestOutput | PoolingRequestOutput]
wait_for_completion(
 output_type: type[_O] | tuple[type[_O], ...],
 *,
 use_tqdm: bool | Callable[..., tqdm] = True,
) -> list[_O]

Wait for all enqueued requests to complete and return results.

This method processes all requests currently in the engine queue and returns their outputs. Use after enqueue() to get results.

Parameters:

  • output_type

    (type[Any] | tuple[type[Any], ...] | None, default: None ) –

    The expected output type(s). If not provided, accepts both RequestOutput and PoolingRequestOutput.

  • use_tqdm

    (bool | Callable[..., tqdm], default: True ) –

    If True, shows a tqdm progress bar.

Returns:

  • list[Any]

    A list of output objects for all completed requests.

wake_up(tags=None)

Wake up the engine from sleep mode. See the sleep method for more details.

Parameters:

  • tags

    (list[str] | None, default: None ) –

    An optional list of tags to reallocate the engine memory for specific memory allocations. Values must be in ("weights", "kv_cache", "scheduling"). If None, all memory is reallocated. wake_up should be called with all tags (or None) before the engine is used again. Use tags=["scheduling"] to resume from level 0 sleep.

PoolingOutput dataclass

The output data of one pooling output of a request.

Parameters:

  • data

    (Tensor) –

    The extracted hidden states.

PoolingParams

Bases: Struct

API parameters for pooling models.

Attributes:

  • use_activation (bool | None) –

    Whether to apply activation function to the pooler outputs. None uses the pooler's default, which is True in most cases.

  • dimensions (int | None) –

    Reduce the dimensions of embeddings if model support matryoshka representation.

Methods:

  • clone

    Returns a deep copy of the PoolingParams instance.

clone()

Returns a deep copy of the PoolingParams instance.

PoolingRequestOutput

Bases: Generic[_O]

The output data of a pooling request to the LLM.

Parameters:

  • request_id

    (str) –

    A unique identifier for the pooling request.

  • outputs

    (PoolingOutput) –

    The pooling results for the given input.

  • prompt_token_ids

    (list[int]) –

    A list of token IDs used in the prompt.

  • num_cached_tokens

    (int) –

    The number of tokens with prefix cache hit.

  • finished

    (bool) –

    A flag indicating whether the pooling is completed.

RequestOutput

The output data of a completion request to the LLM.

Parameters:

  • request_id

    (str) –

    The unique ID of the request.

  • prompt

    (str | None) –

    The prompt string of the request. For encoder/decoder models, this is the decoder input prompt.

  • prompt_token_ids

    (list[int] | None) –

    The token IDs of the prompt. For encoder/decoder models, this is the decoder input prompt token ids.

  • prompt_logprobs

    (PromptLogprobs | None) –

    The log probabilities to return per prompt token.

  • outputs

    (list[CompletionOutput]) –

    The output sequences of the request.

  • finished

    (bool) –

    Whether the whole request is finished.

  • metrics

    (RequestStateStats | None, default: None ) –

    Metrics associated with the request.

  • lora_request

    (LoRARequest | None, default: None ) –

    The LoRA request that was used to generate the output.

  • encoder_prompt

    (str | None, default: None ) –

    The encoder prompt string of the request. None if decoder-only.

  • encoder_prompt_token_ids

    (list[int] | None, default: None ) –

    The token IDs of the encoder prompt. None if decoder-only.

  • num_cached_tokens

    (int | None, default: None ) –

    The number of tokens with prefix cache hit.

  • kv_transfer_params

    (dict[str, Any] | None, default: None ) –

    The params for remote K/V transfer.

Methods:

  • add

    Merge subsequent RequestOutput into this one

add(next_output, aggregate)

Merge subsequent RequestOutput into this one

SamplingParams

Bases: PydanticMsgspecMixin, Struct

Sampling parameters for text generation.

Overall, we follow the sampling parameters from the OpenAI text completion API (https://platform.openai.com/docs/api-reference/completions/create). In addition, we support beam search, which is not supported by OpenAI.

Methods:

Attributes:

allowed_token_ids = None class-attribute instance-attribute

If provided, the engine will construct a logits processor which only retains scores for the given token ids.

bad_words = None class-attribute instance-attribute

Words that are not allowed to be generated. More precisely, only the last token of a corresponding token sequence is not allowed when the next generated token can complete the sequence.

detokenize = True class-attribute instance-attribute

Whether to detokenize the output.

extra_args = None class-attribute instance-attribute

Arbitrary additional args, that can be used by custom sampling implementations, plugins, etc. Not used by any in-tree sampling implementations.

flat_logprobs = False class-attribute instance-attribute

Whether to return logprobs in flatten format (i.e. FlatLogprob) for better performance. NOTE: GC costs of FlatLogprobs is significantly smaller than list[dict[int, Logprob]]. After enabled, PromptLogprobs and SampleLogprobs would populated as FlatLogprobs.

frequency_penalty = 0.0 class-attribute instance-attribute

Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.

ignore_eos = False class-attribute instance-attribute

Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.

include_stop_str_in_output = False class-attribute instance-attribute

Whether to include the stop strings in output text.

logit_bias = None class-attribute instance-attribute

If provided, the engine will construct a logits processor that applies these logit biases.

logprob_token_ids = None class-attribute instance-attribute

Specific token IDs to return logprobs for. More efficient than logprobs=-1 when you only need logprobs for a small set of tokens. When set, logprobs for exactly these token IDs will be returned, in addition to the sampled token. This is useful for scoring tasks where you want to compare probabilities of specific label tokens.

logprobs = None class-attribute instance-attribute

Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens. Note that the implementation follows the OpenAI API: The API will always return the log probability of the sampled token, so there may be up to logprobs+1 elements in the response. When set to -1, return all vocab_size log probabilities.

max_tokens = 16 class-attribute instance-attribute

Maximum number of tokens to generate per output sequence.

min_p = 0.0 class-attribute instance-attribute

Represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.

min_tokens = 0 class-attribute instance-attribute

Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated

n = 1 class-attribute instance-attribute

Number of outputs to return for the given prompt request.

The maximum allowed value is controlled by the VLLM_MAX_N_SEQUENCES environment variable (default: 16384).

num_logprobs property

Number of sample logprobs to return per output token, or None if no sample logprobs were requested. Takes logprob_token_ids into account: when logprobs is unset but logprob_token_ids is set, returns len(logprob_token_ids).

presence_penalty = 0.0 class-attribute instance-attribute

Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.

prompt_logprobs = None class-attribute instance-attribute

Number of log probabilities to return per prompt token. When set to -1, return all vocab_size log probabilities.

repetition_detection = None class-attribute instance-attribute

Parameters for detecting repetitive N-gram patterns in output tokens. If such repetition is detected, generation will be ended early. LLMs can sometimes generate repetitive, unhelpful token patterns, stopping only when they hit the maximum output length (e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature can detect such behavior and terminate early, saving time and tokens.

repetition_penalty = 1.0 class-attribute instance-attribute

Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.

routed_experts_prompt_start = 0 class-attribute instance-attribute

When enable_return_routed_experts is active, skip the first routed_experts_prompt_start prompt tokens from the returned routing data. In multi-turn agent scenarios, set this to the length of the already-returned prefix to avoid duplicating routing for prompt tokens covered by earlier turns. Default 0 returns routing for all prompt tokens.

seed = None class-attribute instance-attribute

Random seed to use for the generation.

skip_clone = False class-attribute instance-attribute

Internal flag indicating that this SamplingParams instance is safe to reuse without cloning. When True, clone() will return self without performing a deep copy. This should only be set when the params object is guaranteed to be dedicated to a single request and won't be modified in ways that would affect other uses.

skip_special_tokens = True class-attribute instance-attribute

Whether to skip special tokens in the output.

spaces_between_special_tokens = True class-attribute instance-attribute

Whether to add spaces between special tokens in the output.

stop = None class-attribute instance-attribute

String(s) that stop the generation when they are generated. The returned output will not contain the stop strings.

stop_token_ids = None class-attribute instance-attribute

Token IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.

structured_outputs = None class-attribute instance-attribute

Parameters for configuring structured outputs.

temperature = 1.0 class-attribute instance-attribute

Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.

thinking_token_budget = None class-attribute instance-attribute

Maximum number of tokens allowed for thinking operations.

top_k = 0 class-attribute instance-attribute

Controls the number of top tokens to consider. Set to 0 (or -1) to consider all tokens.

top_p = 1.0 class-attribute instance-attribute

Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

_validate_logit_bias(model_config)

Validate logit_bias token IDs are within vocabulary range.

clone()

If skip_clone is True, uses shallow copy instead of deep copy.

for_sampler_warmup() staticmethod

Set parameters to exercise all sampler logic.

update_from_generation_config(generation_config, eos_token_id=None)

Update if there are non-default values from generation_config

ScoringOutput dataclass

The output data of one scoring output of a request.

Parameters:

  • score

    (float) –

    The similarity score, which is a scalar value.

TextPrompt

Bases: _PromptOptions

Schema for a text prompt.

Attributes:

  • prompt (str) –

    The input text to be tokenized before passing to the model.

prompt instance-attribute

The input text to be tokenized before passing to the model.

TokensPrompt

Bases: _PromptOptions

Schema for a tokenized prompt.

Attributes:

prompt instance-attribute

The prompt text corresponding to the token IDs, if available.

prompt_token_ids instance-attribute

A list of token IDs to pass to the model.

token_type_ids instance-attribute

A list of token type IDs to pass to the cross encoder model.

initialize_ray_cluster(parallel_config, ray_address=None, require_gpu_on_driver=True)

Initialize the distributed cluster with Ray.

it will connect to the Ray cluster and create a placement group for the workers, which includes the specification of the resources for each distributed worker.

Parameters:

  • parallel_config

    (ParallelConfig) –

    The configurations for parallel execution.

  • ray_address

    (str | None, default: None ) –

    The address of the Ray cluster. If None, uses the default Ray cluster address.

  • require_gpu_on_driver

    (bool, default: True ) –

    If True (default), require at least one GPU on the current (driver) node and pin the first PG bundle to it. Set to False for executors like RayExecutorV2 where all GPU work is delegated to remote Ray actors.