vllm ¶
vLLM: a high-throughput and memory-efficient inference engine for LLMs
Modules:
-
assets– -
benchmarks– -
collect_env– -
compilation– -
config– -
connections– -
cute_utils– -
device_allocator– -
distributed– -
engine– -
entrypoints– -
env_override– -
envs– -
exceptions–Custom exceptions for vLLM.
-
forward_context– -
inputs– -
ir– -
kernels–Kernel implementations for vLLM.
-
logger–Logging configuration for vLLM.
-
logging_utils– -
logits_process– -
logprobs– -
lora– -
model_executor– -
model_inspection–Model inspection utilities for vLLM.
-
models– -
multimodal– -
outputs– -
parser– -
platforms– -
plugins– -
pooling_params– -
profiler– -
ray– -
reasoning– -
renderers– -
sampling_params–Sampling parameters for text generation.
-
scalar_type– -
sequence–Sequence and its related classes.
-
third_party– -
tokenizers– -
tool_parsers– -
tracing– -
transformers_utils– -
triton_utils– -
usage– -
utils– -
v1– -
version– -
vllm_flash_attn–
Classes:
-
AsyncEngineArgs–Arguments for asynchronous vLLM engine.
-
ClassificationOutput–The output data of one classification output of a request.
-
CompletionOutput–The output data of one completion output of a request.
-
EmbeddingOutput–The output data of one embedding output of a request.
-
EngineArgs–Arguments for vLLM engine.
-
LLM–An LLM for generating texts from given prompts and sampling parameters.
-
PoolingOutput–The output data of one pooling output of a request.
-
PoolingParams–API parameters for pooling models.
-
PoolingRequestOutput–The output data of a pooling request to the LLM.
-
RequestOutput–The output data of a completion request to the LLM.
-
SamplingParams–Sampling parameters for text generation.
-
ScoringOutput–The output data of one scoring output of a request.
-
TextPrompt–Schema for a text prompt.
-
TokensPrompt–Schema for a tokenized prompt.
Functions:
-
initialize_ray_cluster–Initialize the distributed cluster with Ray.
Attributes:
-
AsyncLLMEngine–The
AsyncLLMEngineclass is an alias of vllm.v1.engine.async_llm.AsyncLLM. -
LLMEngine–The
LLMEngineclass is an alias of vllm.v1.engine.llm_engine.LLMEngine. -
PromptType(TypeAlias) –Schema for any prompt, regardless of model type.
AsyncLLMEngine = AsyncLLM module-attribute ¶
The AsyncLLMEngine class is an alias of vllm.v1.engine.async_llm.AsyncLLM.
LLMEngine = V1LLMEngine module-attribute ¶
The LLMEngine class is an alias of vllm.v1.engine.llm_engine.LLMEngine.
PromptType = DecoderOnlyPrompt | EncoderDecoderPrompt module-attribute ¶
Schema for any prompt, regardless of model type.
This is the input format accepted by most LLM APIs.
AsyncEngineArgs dataclass ¶
Bases: EngineArgs
Arguments for asynchronous vLLM engine.
ClassificationOutput dataclass ¶
CompletionOutput dataclass ¶
The output data of one completion output of a request.
Parameters:
-
(index¶int) –The index of the output in the request.
-
(text¶str) –The generated output text.
-
(token_ids¶Sequence[int]) –The token IDs of the generated output text.
-
(cumulative_logprob¶float | None) –The cumulative log probability of the generated output text.
-
(logprobs¶SampleLogprobs | None) –The log probabilities of the top probability words at each position if the logprobs are requested.
-
(finish_reason¶str | None, default:None) –The reason why the sequence is finished.
-
(stop_reason¶int | str | None, default:None) –The stop string or token id that caused the completion to stop, None if the completion finished for some other reason including encountering the EOS token.
-
(lora_request¶LoRARequest | None, default:None) –The LoRA request that was used to generate the output.
EmbeddingOutput dataclass ¶
EngineArgs dataclass ¶
Arguments for vLLM engine.
Methods:
-
add_cli_args–Shared CLI arguments for vLLM engine.
-
create_engine_config–Create the VllmConfig.
-
create_speculative_config–Initializes and returns a SpeculativeConfig object based on
Attributes:
-
logits_processors(list[str | type[LogitsProcessor]] | None) –Custom logitproc types
-
quantization_config(dict[str, Any] | QuantizationConfigArgs | None) –User-facing quantization configuration. Carries per-layer-kind
logits_processors = ModelConfig.logits_processors class-attribute instance-attribute ¶
Custom logitproc types
quantization_config = None class-attribute instance-attribute ¶
User-facing quantization configuration. Carries per-layer-kind QuantSpecs (linear, moe) and ignore patterns; see :class:QuantizationConfigArgs. Auto-populated from the matching online shorthand when quantization is one of the values in ONLINE_QUANT_SHORTHAND_NAMES.
_check_feature_supported() ¶
Raise an error if the feature is not supported.
_get_min_mm_batched_tokens(model_config) staticmethod ¶
Get the minimum max_num_batched_tokens needed for a multimodal prefix-LM model to process at least one item of any supported modality.
Returns (token_count, modality_name) for the most expensive modality, or None if the value cannot be determined at this stage.
add_cli_args(parser) staticmethod ¶
Shared CLI arguments for vLLM engine.
create_engine_config(usage_context=None, headless=False) ¶
Create the VllmConfig.
NOTE: If VllmConfig is incompatible, we raise an error.
create_speculative_config(target_model_config, target_parallel_config) ¶
Initializes and returns a SpeculativeConfig object based on speculative_config.
LLM ¶
Bases: BeamSearchOfflineMixin, PoolingOfflineMixin, OfflineInferenceMixin
An LLM for generating texts from given prompts and sampling parameters.
This class includes a tokenizer, a language model (possibly distributed across multiple GPUs), and GPU memory space allocated for intermediate states (aka KV cache). Given a batch of prompts and sampling parameters, this class generates texts from the model, using an intelligent batching mechanism and efficient memory management.
Parameters:
-
(model¶str) –The name or path of a HuggingFace Transformers model.
-
(tokenizer¶str | None, default:None) –The name or path of a HuggingFace Transformers tokenizer.
-
(tokenizer_mode¶TokenizerMode | str, default:'auto') –The tokenizer mode. "auto" will use the fast tokenizer if available, and "slow" will always use the slow tokenizer.
-
(skip_tokenizer_init¶bool, default:False) –If true, skip initialization of tokenizer and detokenizer. Expect valid prompt_token_ids and None for prompt from the input.
-
(trust_remote_code¶bool, default:False) –Trust remote code (e.g., from HuggingFace) when downloading the model and tokenizer.
-
(allowed_local_media_path¶str, default:'') –Allowing API requests to read local images or videos from directories specified by the server file system. This is a security risk. Should only be enabled in trusted environments.
-
(allowed_media_domains¶list[str] | None, default:None) –If set, only media URLs that belong to this domain can be used for multi-modal inputs.
-
(tensor_parallel_size¶int, default:1) –The number of GPUs to use for distributed execution with tensor parallelism.
-
(dtype¶ModelDType, default:'auto') –The data type for the model weights and activations. Currently, we support
float32,float16, andbfloat16. Ifauto, we use thedtypeattribute of the Transformers model's config. However, if thedtypein the config isfloat32, we will usefloat16instead. -
(quantization¶QuantizationMethods | None, default:None) –The method used to quantize the model weights. Currently, we support "awq", "gptq", and "fp8" (experimental). If None, we first check the
quantization_configattribute in the model config file. If that is None, we assume the model weights are not quantized and usedtypeto determine the data type of the weights. -
(revision¶str | None, default:None) –The specific model version to use. It can be a branch name, a tag name, or a commit id.
-
(tokenizer_revision¶str | None, default:None) –The specific tokenizer version to use. It can be a branch name, a tag name, or a commit id.
-
(chat_template¶Path | str | None, default:None) –The chat template to apply.
-
(seed¶int, default:0) –The seed to initialize the random number generator for sampling.
-
(gpu_memory_utilization¶float, default:0.92) –The ratio (between 0 and 1) of GPU memory to reserve for the model weights, activations, and KV cache. Higher values will increase the KV cache size and thus improve the model's throughput. However, if the value is too high, it may cause out-of- memory (OOM) errors.
-
(kv_cache_memory_bytes¶int | None, default:None) –Size of KV Cache per GPU in bytes. By default, this is set to None and vllm can automatically infer the kv cache size based on gpu_memory_utilization. However, users may want to manually specify the kv cache memory size. kv_cache_memory_bytes allows more fine-grain control of how much memory gets used when compared with using gpu_memory_utilization. Note that kv_cache_memory_bytes (when not-None) ignores gpu_memory_utilization
-
(cpu_offload_gb¶float, default:0) –The size (GiB) of CPU memory to use for offloading the model weights. This virtually increases the GPU memory space you can use to hold the model weights, at the cost of CPU-GPU data transfer for every forward pass.
-
(offload_group_size¶int, default:0) –Prefetch offloading: Group every N layers together. Offload last
offload_num_in_grouplayers of each group. Default is 0 (disabled). -
(offload_num_in_group¶int, default:1) –Prefetch offloading: Number of layers to offload per group. Default is 1.
-
(offload_prefetch_step¶int, default:1) –Prefetch offloading: Number of layers to prefetch ahead. Higher values hide more latency but use more GPU memory. Default is 1.
-
(offload_params¶set[str] | None, default:None) –Prefetch offloading: Set of parameter name segments to selectively offload. Only parameters whose names contain one of these segments will be offloaded (e.g., {"gate_up_proj", "down_proj"} for MLP weights, or {"w13_weight", "w2_weight"} for MoE expert weights). If None or empty, all parameters are offloaded.
-
(enforce_eager¶bool, default:False) –Whether to enforce eager execution. If True, we will disable CUDA graph and always execute the model in eager mode. If False, we will use CUDA graph and eager execution in hybrid.
-
(enable_return_routed_experts¶bool, default:False) –Whether to return routed experts.
-
(disable_custom_all_reduce¶bool, default:False) –See ParallelConfig.
-
(hf_token¶bool | str | None, default:None) –The token to use as HTTP bearer authorization for remote files . If
True, will use the token generated when runninghf auth login(stored in~/.cache/huggingface/token). -
(hf_overrides¶HfOverrides | None, default:None) –If a dictionary, contains arguments to be forwarded to the HuggingFace config. If a callable, it is called to update the HuggingFace config.
-
(mm_processor_kwargs¶dict[str, Any] | None, default:None) –Arguments to be forwarded to the model's processor for multi-modal data, e.g., image processor. Overrides for the multi-modal processor obtained from
AutoProcessor.from_pretrained. The available overrides depend on the model that is being run. For example, for Phi-3-Vision:{"num_crops": 4}. -
(pooler_config¶PoolerConfig | None, default:None) –Initialize non-default pooling config for the pooling model, e.g.,
PoolerConfig(seq_pooling_type="MEAN", use_activation=False). -
(compilation_config¶int | dict[str, Any] | CompilationConfig | None, default:None) –Either an integer or a dictionary. If it is an integer, it is used as the mode of compilation optimization. If it is a dictionary, it can specify the full compilation configuration.
-
(attention_config¶dict[str, Any] | AttentionConfig | None, default:None) –Configuration for attention mechanisms. Can be a dictionary or an AttentionConfig instance. If a dictionary, it will be converted to an AttentionConfig. Allows specifying the attention backend and other attention-related settings.
-
(spec_method¶str | None, default:None) –Top-level alias for
speculative_config["method"]. -
(spec_model¶str | None, default:None) –Top-level alias for
speculative_config["model"]. -
(spec_tokens¶int | None, default:None) –Top-level alias for
speculative_config["num_speculative_tokens"]. -
(**kwargs¶Any, default:{}) –Arguments for
EngineArgs.
Methods:
-
__init__–LLM constructor.
-
__repr__–Return a transformers-style hierarchical view of the model.
-
apply_model–Run a function directly on the model inside each worker,
-
chat–Generate responses for a chat conversation.
-
collective_rpc–Execute an RPC call on all workers.
-
enqueue–Enqueue prompts for generation without waiting for completion.
-
enqueue_chat–Enqueue chat conversations for generation without waiting.
-
finish_weight_update–Finish the current weight update.
-
from_engine_args–Create an LLM instance from EngineArgs.
-
generate–Generates the completions for the input prompts.
-
get_metrics–Return a snapshot of aggregated metrics from Prometheus.
-
get_world_size–Get the world size from the parallel config.
-
init_weight_transfer_engine–Initialize weight transfer for RL training.
-
sleep–Put the engine to sleep. The engine should not process any requests.
-
start_profile–Start profiling with optional custom trace prefix.
-
start_weight_update–Start a new weight update.
-
update_weights–Update the weights of the model.
-
wait_for_completion–Wait for all enqueued requests to complete and return results.
-
wake_up–Wake up the engine from sleep mode. See the sleep
__init__(model, *, runner='auto', convert='auto', tokenizer=None, tokenizer_mode='auto', skip_tokenizer_init=False, trust_remote_code=False, allowed_local_media_path='', allowed_media_domains=None, tensor_parallel_size=1, dtype='auto', quantization=None, revision=None, tokenizer_revision=None, chat_template=None, seed=0, gpu_memory_utilization=0.92, cpu_offload_gb=0, offload_group_size=0, offload_num_in_group=1, offload_prefetch_step=1, offload_params=None, enforce_eager=False, enable_return_routed_experts=False, disable_custom_all_reduce=False, hf_token=None, hf_overrides=None, mm_processor_kwargs=None, pooler_config=None, structured_outputs_config=None, profiler_config=None, attention_config=None, kv_cache_memory_bytes=None, compilation_config=None, quantization_config=None, logits_processors=None, spec_method=None, spec_model=None, spec_tokens=None, **kwargs) ¶
LLM constructor.
__repr__() ¶
Return a transformers-style hierarchical view of the model.
apply_model(func) ¶
Run a function directly on the model inside each worker, returning the result for each of them.
Warning
To reduce the overhead of data transfer, avoid returning large arrays or tensors from this method. If you must return them, make sure you move them to CPU first to avoid taking up additional VRAM!
chat(messages, sampling_params=None, use_tqdm=True, lora_request=None, chat_template=None, chat_template_content_format='auto', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None, tokenization_kwargs=None, mm_processor_kwargs=None) ¶
Generate responses for a chat conversation.
The chat conversation is converted into a text prompt using the tokenizer and calls the generate method to generate the responses.
Multi-modal inputs can be passed in the same way you would pass them to the OpenAI API.
Parameters:
-
(messages¶list[ChatCompletionMessageParam] | Sequence[list[ChatCompletionMessageParam]]) –A sequence of conversations or a single conversation.
- Each conversation is represented as a list of messages.
- Each message is a dictionary with 'role' and 'content' keys.
-
(sampling_params¶SamplingParams | Sequence[SamplingParams] | None, default:None) –The sampling parameters for text generation. If None, we use the default sampling parameters. When it is a single value, it is applied to every prompt. When it is a list, the list must have the same length as the prompts and it is paired one by one with the prompt.
-
(use_tqdm¶bool | Callable[..., tqdm], default:True) –If
True, shows a tqdm progress bar. If a callable (e.g.,functools.partial(tqdm, leave=False)), it is used to create the progress bar. IfFalse, no progress bar is created. -
(lora_request¶Sequence[LoRARequest] | LoRARequest | None, default:None) –LoRA request to use for generation, if any.
-
(chat_template¶str | None, default:None) –The template to use for structuring the chat. If not provided, the model's default chat template will be used.
-
(chat_template_content_format¶ChatTemplateContentFormatOption, default:'auto') –The format to render message content.
- "string" will render the content as a string. Example:
"Who are you?" - "openai" will render the content as a list of dictionaries, similar to OpenAI schema. Example:
[{"type": "text", "text": "Who are you?"}]
- "string" will render the content as a string. Example:
-
(add_generation_prompt¶bool, default:True) –If True, adds a generation template to each message.
-
(continue_final_message¶bool, default:False) –If True, continues the final message in the conversation instead of starting a new one. Cannot be
Trueifadd_generation_promptis alsoTrue. -
(chat_template_kwargs¶dict[str, Any] | None, default:None) –Additional kwargs to pass to the chat template.
-
(tokenization_kwargs¶dict[str, Any] | None, default:None) –Overrides for
tokenizer.encode. -
(mm_processor_kwargs¶dict[str, Any] | None, default:None) –Overrides for
processor.__call__.
Returns:
-
list[RequestOutput]–A list of
RequestOutputobjects containing the generated -
list[RequestOutput]–responses in the same order as the input messages.
collective_rpc(method, timeout=None, args=(), kwargs=None) ¶
Execute an RPC call on all workers.
Parameters:
-
(method¶str | Callable[..., _R]) –Name of the worker method to execute, or a callable that is serialized and sent to all workers to execute.
If the method is a callable, it should accept an additional
selfargument, in addition to the arguments passed inargsandkwargs. Theselfargument will be the worker object. -
(timeout¶float | None, default:None) –Maximum time in seconds to wait for execution. Raises a
TimeoutErroron timeout.Nonemeans wait indefinitely. -
(args¶tuple, default:()) –Positional arguments to pass to the worker method.
-
(kwargs¶dict[str, Any] | None, default:None) –Keyword arguments to pass to the worker method.
Returns:
-
list[_R]–A list containing the results from each worker.
enqueue(prompts, sampling_params=None, lora_request=None, priority=None, use_tqdm=True, tokenization_kwargs=None, mm_processor_kwargs=None) ¶
Enqueue prompts for generation without waiting for completion.
This method adds requests to the engine queue but does not start processing them. Use wait_for_completion() to process the queued requests and get results.
Parameters:
-
(prompts¶PromptType | Sequence[PromptType]) –The prompts to the LLM. See generate() for details.
-
(sampling_params¶SamplingParams | Sequence[SamplingParams] | None, default:None) –The sampling parameters for text generation.
-
(lora_request¶Sequence[LoRARequest] | LoRARequest | None, default:None) –LoRA request to use for generation, if any.
-
(priority¶list[int] | None, default:None) –The priority of the requests, if any.
-
(use_tqdm¶bool | Callable[..., tqdm], default:True) –If True, shows a tqdm progress bar while adding requests.
-
(tokenization_kwargs¶dict[str, Any] | None, default:None) –Overrides for
tokenizer.encode. -
(mm_processor_kwargs¶dict[str, Any] | None, default:None) –Overrides for
processor.__call__.
Returns:
enqueue_chat(messages, sampling_params=None, use_tqdm=True, lora_request=None, priority=None, chat_template=None, chat_template_content_format='auto', add_generation_prompt=True, continue_final_message=False, tools=None, chat_template_kwargs=None, tokenization_kwargs=None, mm_processor_kwargs=None) ¶
Enqueue chat conversations for generation without waiting.
This method renders chat conversations and adds the resulting requests to the engine queue. Use wait_for_completion() to get results. To guarantee that all requests are queued before scheduling starts, pause scheduling with sleep(level=0) before calling this method and resume it with wake_up(tags=["scheduling"]) afterward.
Parameters:
-
(messages¶list[ChatCompletionMessageParam] | Sequence[list[ChatCompletionMessageParam]]) –A sequence of conversations or a single conversation. Each conversation is represented as a list of messages.
-
(sampling_params¶SamplingParams | Sequence[SamplingParams] | None, default:None) –The sampling parameters for text generation. If None, we use the default sampling parameters.
-
(use_tqdm¶bool | Callable[..., tqdm], default:True) –If
True, shows a tqdm progress bar while rendering conversations. -
(lora_request¶Sequence[LoRARequest] | LoRARequest | None, default:None) –LoRA request to use for generation, if any.
-
(priority¶list[int] | None, default:None) –The priority of the requests, if any.
-
(chat_template¶str | None, default:None) –The template to use for structuring the chat.
-
(chat_template_content_format¶ChatTemplateContentFormatOption, default:'auto') –The format to render message content.
-
(add_generation_prompt¶bool, default:True) –If True, adds a generation template to each message.
-
(continue_final_message¶bool, default:False) –If True, continues the final message in the conversation instead of starting a new one.
-
(tools¶list[dict[str, Any]] | None, default:None) –Tools to make available to the model, if any.
-
(chat_template_kwargs¶dict[str, Any] | None, default:None) –Additional kwargs to pass to the chat template.
-
(tokenization_kwargs¶dict[str, Any] | None, default:None) –Overrides for
tokenizer.encode. -
(mm_processor_kwargs¶dict[str, Any] | None, default:None) –Overrides for
processor.__call__.
Returns:
finish_weight_update() ¶
Finish the current weight update.
from_engine_args(engine_args) classmethod ¶
Create an LLM instance from EngineArgs.
generate(prompts, sampling_params=None, *, use_tqdm=True, lora_request=None, priority=None, tokenization_kwargs=None, mm_processor_kwargs=None) ¶
Generates the completions for the input prompts.
This class automatically batches the given prompts, considering the memory constraint. For the best performance, put all of your prompts into a single list and pass it to this method.
Parameters:
-
(prompts¶PromptType | Sequence[PromptType]) –The prompts to the LLM. You may pass a sequence of prompts for batch inference. See PromptType for more details about the format of each prompt.
-
(sampling_params¶SamplingParams | Sequence[SamplingParams] | None, default:None) –The sampling parameters for text generation. If None, we use the default sampling parameters. When it is a single value, it is applied to every prompt. When it is a list, the list must have the same length as the prompts and it is paired one by one with the prompt.
-
(use_tqdm¶bool | Callable[..., tqdm], default:True) –If
True, shows a tqdm progress bar. If a callable (e.g.,functools.partial(tqdm, leave=False)), it is used to create the progress bar. IfFalse, no progress bar is created. -
(lora_request¶Sequence[LoRARequest] | LoRARequest | None, default:None) –LoRA request to use for generation, if any.
-
(priority¶list[int] | None, default:None) –The priority of the requests, if any. Only applicable when priority scheduling policy is enabled. If provided, must be a list of integers matching the length of
prompts, where each priority value corresponds to the prompt at the same index. -
(tokenization_kwargs¶dict[str, Any] | None, default:None) –Overrides for
tokenizer.encode. -
(mm_processor_kwargs¶dict[str, Any] | None, default:None) –Overrides for
processor.__call__.
Returns:
-
list[RequestOutput]–A list of
RequestOutputobjects containing the -
list[RequestOutput]–generated completions in the same order as the input prompts.
get_metrics() ¶
get_world_size(include_dp=True) ¶
init_weight_transfer_engine(request) ¶
Initialize weight transfer for RL training.
Parameters:
-
(request¶WeightTransferInitRequest | dict) –Weight transfer initialization request with backend-specific info
sleep(level=1, mode='abort') ¶
Put the engine to sleep. The engine should not process any requests. The caller should guarantee that no requests are being processed during the sleep period, before wake_up is called.
Parameters:
-
(level¶int, default:1) –The sleep level. - Level 0: Pause scheduling but continue accepting requests. Requests are queued but not processed. - Level 1: Offload model weights to CPU, discard KV cache. The content of kv cache is forgotten. Good for sleeping and waking up the engine to run the same model again. Please make sure there's enough CPU memory to store the model weights. - Level 2: Discard all GPU memory (weights + KV cache). Good for sleeping and waking up the engine to run a different model or update the model, where previous model weights are not needed. It reduces CPU memory pressure.
-
(mode¶PauseMode, default:'abort') –How to handle any existing requests, can be "abort", "wait", or "keep".
start_profile(profile_prefix=None) ¶
start_weight_update(is_checkpoint_format=True) ¶
Start a new weight update.
update_weights(request) ¶
Update the weights of the model.
Parameters:
-
(request¶WeightTransferUpdateRequest | dict) –Weight update request with backend-specific update info
wait_for_completion(output_type=None, *, use_tqdm=True) ¶
wait_for_completion(
*, use_tqdm: bool | Callable[..., tqdm] = True
) -> list[RequestOutput | PoolingRequestOutput]
wait_for_completion(
output_type: type[_O] | tuple[type[_O], ...],
*,
use_tqdm: bool | Callable[..., tqdm] = True,
) -> list[_O]
Wait for all enqueued requests to complete and return results.
This method processes all requests currently in the engine queue and returns their outputs. Use after enqueue() to get results.
Parameters:
-
(output_type¶type[Any] | tuple[type[Any], ...] | None, default:None) –The expected output type(s). If not provided, accepts both RequestOutput and PoolingRequestOutput.
-
(use_tqdm¶bool | Callable[..., tqdm], default:True) –If True, shows a tqdm progress bar.
Returns:
wake_up(tags=None) ¶
Wake up the engine from sleep mode. See the sleep method for more details.
Parameters:
-
(tags¶list[str] | None, default:None) –An optional list of tags to reallocate the engine memory for specific memory allocations. Values must be in
("weights", "kv_cache", "scheduling"). If None, all memory is reallocated. wake_up should be called with all tags (or None) before the engine is used again. Use tags=["scheduling"] to resume from level 0 sleep.
PoolingOutput dataclass ¶
PoolingParams ¶
Bases: Struct
API parameters for pooling models.
Attributes:
-
use_activation(bool | None) –Whether to apply activation function to the pooler outputs.
Noneuses the pooler's default, which isTruein most cases. -
dimensions(int | None) –Reduce the dimensions of embeddings if model support matryoshka representation.
Methods:
-
clone–Returns a deep copy of the PoolingParams instance.
clone() ¶
Returns a deep copy of the PoolingParams instance.
PoolingRequestOutput ¶
Bases: Generic[_O]
The output data of a pooling request to the LLM.
Parameters:
-
(request_id¶str) –A unique identifier for the pooling request.
-
(outputs¶PoolingOutput) –The pooling results for the given input.
-
(prompt_token_ids¶list[int]) –A list of token IDs used in the prompt.
-
(num_cached_tokens¶int) –The number of tokens with prefix cache hit.
-
(finished¶bool) –A flag indicating whether the pooling is completed.
RequestOutput ¶
The output data of a completion request to the LLM.
Parameters:
-
(request_id¶str) –The unique ID of the request.
-
(prompt¶str | None) –The prompt string of the request. For encoder/decoder models, this is the decoder input prompt.
-
(prompt_token_ids¶list[int] | None) –The token IDs of the prompt. For encoder/decoder models, this is the decoder input prompt token ids.
-
(prompt_logprobs¶PromptLogprobs | None) –The log probabilities to return per prompt token.
-
(outputs¶list[CompletionOutput]) –The output sequences of the request.
-
(finished¶bool) –Whether the whole request is finished.
-
(metrics¶RequestStateStats | None, default:None) –Metrics associated with the request.
-
(lora_request¶LoRARequest | None, default:None) –The LoRA request that was used to generate the output.
-
(encoder_prompt¶str | None, default:None) –The encoder prompt string of the request. None if decoder-only.
-
(encoder_prompt_token_ids¶list[int] | None, default:None) –The token IDs of the encoder prompt. None if decoder-only.
-
(num_cached_tokens¶int | None, default:None) –The number of tokens with prefix cache hit.
-
(kv_transfer_params¶dict[str, Any] | None, default:None) –The params for remote K/V transfer.
Methods:
-
add–Merge subsequent RequestOutput into this one
add(next_output, aggregate) ¶
Merge subsequent RequestOutput into this one
SamplingParams ¶
Bases: PydanticMsgspecMixin, Struct
Sampling parameters for text generation.
Overall, we follow the sampling parameters from the OpenAI text completion API (https://platform.openai.com/docs/api-reference/completions/create). In addition, we support beam search, which is not supported by OpenAI.
Methods:
-
clone–If skip_clone is True, uses shallow copy instead of deep copy.
-
for_sampler_warmup–Set parameters to exercise all sampler logic.
-
update_from_generation_config–Update if there are non-default values from generation_config
Attributes:
-
allowed_token_ids(list[int] | None) –If provided, the engine will construct a logits processor which only
-
bad_words(list[str] | None) –Words that are not allowed to be generated. More precisely, only the
-
detokenize(bool) –Whether to detokenize the output.
-
extra_args(dict[str, Any] | None) –Arbitrary additional args, that can be used by custom sampling
-
flat_logprobs(bool) –Whether to return logprobs in flatten format (i.e. FlatLogprob)
-
frequency_penalty(float) –Penalizes new tokens based on their frequency in the generated text so
-
ignore_eos(bool) –Whether to ignore the EOS token and continue generating
-
include_stop_str_in_output(bool) –Whether to include the stop strings in output text.
-
logit_bias(dict[int, float] | None) –If provided, the engine will construct a logits processor that applies
-
logprob_token_ids(list[int] | None) –Specific token IDs to return logprobs for. More efficient than
-
logprobs(int | None) –Number of log probabilities to return per output token. When set to
-
max_tokens(int | None) –Maximum number of tokens to generate per output sequence.
-
min_p(float) –Represents the minimum probability for a token to be considered,
-
min_tokens(int) –Minimum number of tokens to generate per output sequence before EOS or
-
n(int) –Number of outputs to return for the given prompt request.
-
num_logprobs(int | None) –Number of sample logprobs to return per output token, or
Noneif -
presence_penalty(float) –Penalizes new tokens based on whether they appear in the generated text
-
prompt_logprobs(int | None) –Number of log probabilities to return per prompt token.
-
repetition_detection(RepetitionDetectionParams | None) –Parameters for detecting repetitive N-gram patterns in output tokens.
-
repetition_penalty(float) –Penalizes new tokens based on whether they appear in the prompt and the
-
routed_experts_prompt_start(int) –When enable_return_routed_experts is active, skip the first
-
seed(int | None) –Random seed to use for the generation.
-
skip_clone(bool) –Internal flag indicating that this SamplingParams instance is safe to
-
skip_special_tokens(bool) –Whether to skip special tokens in the output.
-
spaces_between_special_tokens(bool) –Whether to add spaces between special tokens in the output.
-
stop(str | list[str] | None) –String(s) that stop the generation when they are generated. The returned
-
stop_token_ids(list[int] | None) –Token IDs that stop the generation when they are generated. The returned
-
structured_outputs(StructuredOutputsParams | None) –Parameters for configuring structured outputs.
-
temperature(float) –Controls the randomness of the sampling. Lower values make the model
-
thinking_token_budget(int | None) –Maximum number of tokens allowed for thinking operations.
-
top_k(int) –Controls the number of top tokens to consider. Set to 0 (or -1) to
-
top_p(float) –Controls the cumulative probability of the top tokens to consider. Must
allowed_token_ids = None class-attribute instance-attribute ¶
If provided, the engine will construct a logits processor which only retains scores for the given token ids.
bad_words = None class-attribute instance-attribute ¶
Words that are not allowed to be generated. More precisely, only the last token of a corresponding token sequence is not allowed when the next generated token can complete the sequence.
detokenize = True class-attribute instance-attribute ¶
Whether to detokenize the output.
extra_args = None class-attribute instance-attribute ¶
Arbitrary additional args, that can be used by custom sampling implementations, plugins, etc. Not used by any in-tree sampling implementations.
flat_logprobs = False class-attribute instance-attribute ¶
Whether to return logprobs in flatten format (i.e. FlatLogprob) for better performance. NOTE: GC costs of FlatLogprobs is significantly smaller than list[dict[int, Logprob]]. After enabled, PromptLogprobs and SampleLogprobs would populated as FlatLogprobs.
frequency_penalty = 0.0 class-attribute instance-attribute ¶
Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
ignore_eos = False class-attribute instance-attribute ¶
Whether to ignore the EOS token and continue generating tokens after the EOS token is generated.
include_stop_str_in_output = False class-attribute instance-attribute ¶
Whether to include the stop strings in output text.
logit_bias = None class-attribute instance-attribute ¶
If provided, the engine will construct a logits processor that applies these logit biases.
logprob_token_ids = None class-attribute instance-attribute ¶
Specific token IDs to return logprobs for. More efficient than logprobs=-1 when you only need logprobs for a small set of tokens. When set, logprobs for exactly these token IDs will be returned, in addition to the sampled token. This is useful for scoring tasks where you want to compare probabilities of specific label tokens.
logprobs = None class-attribute instance-attribute ¶
Number of log probabilities to return per output token. When set to None, no probability is returned. If set to a non-None value, the result includes the log probabilities of the specified number of most likely tokens, as well as the chosen tokens. Note that the implementation follows the OpenAI API: The API will always return the log probability of the sampled token, so there may be up to logprobs+1 elements in the response. When set to -1, return all vocab_size log probabilities.
max_tokens = 16 class-attribute instance-attribute ¶
Maximum number of tokens to generate per output sequence.
min_p = 0.0 class-attribute instance-attribute ¶
Represents the minimum probability for a token to be considered, relative to the probability of the most likely token. Must be in [0, 1]. Set to 0 to disable this.
min_tokens = 0 class-attribute instance-attribute ¶
Minimum number of tokens to generate per output sequence before EOS or stop_token_ids can be generated
n = 1 class-attribute instance-attribute ¶
Number of outputs to return for the given prompt request.
The maximum allowed value is controlled by the VLLM_MAX_N_SEQUENCES environment variable (default: 16384).
num_logprobs property ¶
Number of sample logprobs to return per output token, or None if no sample logprobs were requested. Takes logprob_token_ids into account: when logprobs is unset but logprob_token_ids is set, returns len(logprob_token_ids).
presence_penalty = 0.0 class-attribute instance-attribute ¶
Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat tokens.
prompt_logprobs = None class-attribute instance-attribute ¶
Number of log probabilities to return per prompt token. When set to -1, return all vocab_size log probabilities.
repetition_detection = None class-attribute instance-attribute ¶
Parameters for detecting repetitive N-gram patterns in output tokens. If such repetition is detected, generation will be ended early. LLMs can sometimes generate repetitive, unhelpful token patterns, stopping only when they hit the maximum output length (e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature can detect such behavior and terminate early, saving time and tokens.
repetition_penalty = 1.0 class-attribute instance-attribute ¶
Penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
routed_experts_prompt_start = 0 class-attribute instance-attribute ¶
When enable_return_routed_experts is active, skip the first routed_experts_prompt_start prompt tokens from the returned routing data. In multi-turn agent scenarios, set this to the length of the already-returned prefix to avoid duplicating routing for prompt tokens covered by earlier turns. Default 0 returns routing for all prompt tokens.
seed = None class-attribute instance-attribute ¶
Random seed to use for the generation.
skip_clone = False class-attribute instance-attribute ¶
Internal flag indicating that this SamplingParams instance is safe to reuse without cloning. When True, clone() will return self without performing a deep copy. This should only be set when the params object is guaranteed to be dedicated to a single request and won't be modified in ways that would affect other uses.
skip_special_tokens = True class-attribute instance-attribute ¶
Whether to skip special tokens in the output.
spaces_between_special_tokens = True class-attribute instance-attribute ¶
Whether to add spaces between special tokens in the output.
stop = None class-attribute instance-attribute ¶
String(s) that stop the generation when they are generated. The returned output will not contain the stop strings.
stop_token_ids = None class-attribute instance-attribute ¶
Token IDs that stop the generation when they are generated. The returned output will contain the stop tokens unless the stop tokens are special tokens.
structured_outputs = None class-attribute instance-attribute ¶
Parameters for configuring structured outputs.
temperature = 1.0 class-attribute instance-attribute ¶
Controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
thinking_token_budget = None class-attribute instance-attribute ¶
Maximum number of tokens allowed for thinking operations.
top_k = 0 class-attribute instance-attribute ¶
Controls the number of top tokens to consider. Set to 0 (or -1) to consider all tokens.
top_p = 1.0 class-attribute instance-attribute ¶
Controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
_validate_logit_bias(model_config) ¶
Validate logit_bias token IDs are within vocabulary range.
clone() ¶
If skip_clone is True, uses shallow copy instead of deep copy.
for_sampler_warmup() staticmethod ¶
Set parameters to exercise all sampler logic.
update_from_generation_config(generation_config, eos_token_id=None) ¶
Update if there are non-default values from generation_config
ScoringOutput dataclass ¶
TextPrompt ¶
Bases: _PromptOptions
Schema for a text prompt.
Attributes:
prompt instance-attribute ¶
The input text to be tokenized before passing to the model.
TokensPrompt ¶
Bases: _PromptOptions
Schema for a tokenized prompt.
Attributes:
-
prompt(NotRequired[str]) –The prompt text corresponding to the token IDs, if available.
-
prompt_token_ids(list[int]) –A list of token IDs to pass to the model.
-
token_type_ids(NotRequired[list[int]]) –A list of token type IDs to pass to the cross encoder model.
initialize_ray_cluster(parallel_config, ray_address=None, require_gpu_on_driver=True) ¶
Initialize the distributed cluster with Ray.
it will connect to the Ray cluster and create a placement group for the workers, which includes the specification of the resources for each distributed worker.
Parameters:
-
(parallel_config¶ParallelConfig) –The configurations for parallel execution.
-
(ray_address¶str | None, default:None) –The address of the Ray cluster. If None, uses the default Ray cluster address.
-
(require_gpu_on_driver¶bool, default:True) –If True (default), require at least one GPU on the current (driver) node and pin the first PG bundle to it. Set to False for executors like RayExecutorV2 where all GPU work is delegated to remote Ray actors.
