Use vLLM for high-throughput production deployments, batch processing, or serving models via an API.
Installation
- pip
- Docker
Install
vLLM v0.14 or a more recent version:uv pip install vllm==0.14
Basic Usage
TheLLM class provides a simple interface for offline inference. Use the chat() method to automatically apply the chat template and generate text:
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(model="LiquidAI/LFM2.5-1.2B-Instruct")
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.1,
top_k=50,
repetition_penalty=1.05,
max_tokens=512
)
# Generate answer
prompt = "What is C. elegans?"
output = llm.chat(prompt, sampling_params)
print(output[0].outputs[0].text)
Sampling Parameters
Control text generation behavior usingSamplingParams. Key parameters:
temperature(float, default 1.0): Controls randomness (0.0 = deterministic, higher = more random). Typical range: 0.1-2.0top_p(float, default 1.0): Nucleus sampling - limits to tokens with cumulative probability ≤ top_p. Typical range: 0.1-1.0top_k(int, default -1): Limits to top-k most probable tokens (-1 = disabled). Typical range: 1-100min_p(float): Minimum token probability threshold. Typical range: 0.01-0.2max_tokens(int): Maximum number of tokens to generaterepetition_penalty(float, default 1.0): Penalty for repeating tokens (>1.0 = discourage repetition). Typical range: 1.0-1.5stop(strorlist[str]): Strings that terminate generation when encountered
SamplingParams object:
from vllm import SamplingParams
sampling_params = SamplingParams(
temperature=0.1,
top_k=50,
repetition_penalty=1.05,
max_tokens=512,
)
Batched Generation
vLLM automatically batches multiple prompts for efficient processing. You can control batch behavior and generate responses for large datasets:from vllm import LLM, SamplingParams
llm = LLM(model="LiquidAI/LFM2.5-1.2B-Instruct")
sampling_params = SamplingParams(
temperature=0.1,
top_k=50,
repetition_penalty=1.05,
max_tokens=512
)
# Large batch of prompts
prompts = [
"Explain quantum computing in one sentence.",
"What are the benefits of exercise?",
"Write a haiku about programming.",
# ... many more prompts
]
# Generate list of answers
outputs = llm.chat(prompts, sampling_params)
for i, output in enumerate(outputs):
print(f"Prompt {i}: {output.prompt}")
print(f"Generated: {output.outputs[0].text}\n")
OpenAI-Compatible Server
vLLM can serve models through an OpenAI-compatible API, allowing you to use existing OpenAI client libraries.- vllm serve
- Docker
vllm serve LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype auto
--max-model-len L: Set maximum context length--gpu-memory-utilization 0.9: Set GPU memory usage (0.0-1.0)
Chat Completions
Once running, you can use the OpenAI Python client or any OpenAI-compatible tool:from openai import OpenAI
# Point to your vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # vLLM doesn't require authentication by default
)
# Chat completion
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-1.2B-Instruct",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
temperature=0.1,
max_tokens=512,
extra_body={"top_k": 50, "repetition_penalty": 1.05}
)
print(response.choices[0].message.content)
# Streaming response
stream = client.chat.completions.create(
model="LiquidAI/LFM2.5-1.2B-Instruct",
messages=[
{"role": "user", "content": "Tell me a story."}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
Vision Models
Installation for Vision Models
To use LFM Vision Models with vLLM, install the required versions:uv pip install vllm==0.19.0
uv pip install transformers==5.5.0 pillow
Basic Usage
Initialize a vision model and process text and image inputs:from vllm import LLM, SamplingParams
def build_messages(parts):
content = []
for item in parts:
if item["type"] == "text":
content.append({"type": "text", "text": item["value"]})
elif item["type"] == "image":
content.append({"type": "image_url", "image_url": {"url": item["value"]}})
else:
raise ValueError(f"Unknown item type: {item['type']}")
return [{"role": "user", "content": content}]
IMAGE_URL = "http://images.cocodataset.org/val2017/000000039769.jpg"
llm = LLM(
model="LiquidAI/LFM2.5-VL-1.6B",
max_model_len=1024,
)
sampling_params = SamplingParams(
temperature=0.1,
min_p=0.15,
repetition_penalty=1.05,
max_tokens=1024,
)
# Batch multiple prompts - text-only and multimodal
prompts = [
[{"type": "text", "value": "What is C. elegans?"}],
[{"type": "text", "value": "Say hi in JSON format"}],
[{"type": "text", "value": "Define AI in Spanish"}],
[
{"type": "image", "value": IMAGE_URL},
{"type": "text", "value": "Describe what you see in this image."},
],
]
conversations = [build_messages(p) for p in prompts]
outputs = llm.chat(conversations, sampling_params)
for output in outputs:
print(output.outputs[0].text)
OpenAI-Compatible API
You can also serve vision models through the OpenAI-compatible API:vllm serve LiquidAI/LFM2.5-VL-1.6B \
--host 0.0.0.0 \
--port 8000 \
--dtype auto
from openai import OpenAI
from PIL import Image
import base64
from io import BytesIO
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy"
)
# Load and encode image
image = Image.open("path/to/image.jpg")
buffered = BytesIO()
image.save(buffered, format="JPEG")
image_base64 = base64.b64encode(buffered.getvalue()).decode()
# Chat completion with image
response = client.chat.completions.create(
model="LiquidAI/LFM2.5-VL-1.6B",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail."},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_base64}"}}
]
}
],
temperature=0.1,
max_tokens=512,
extra_body={"min_p": 0.15, "repetition_penalty": 1.05}
)
print(response.choices[0].message.content)
For a complete working example, see the vLLM Vision Model Colab notebook.
