VOOZH about

URL: https://www.buildfastwithai.com/blogs/supercharge-llm-inference-with-vllm

⇱ Supercharge LLM Inference with vLLM


Mentorship

Agentic AI Launchpad

Go from user to builder in 6 weeks.

Explore Program
Share:

Are you hesitating while the next big breakthrough happens?

Don’t wait—be part of Gen AI Launch Pad 2025 and make history.

Introduction

Large Language Models (LLMs) are at the forefront of AI-driven applications, but running them efficiently remains a challenge due to their high computational and memory requirements. vLLM is a powerful, optimized inference engine designed to enhance the speed and efficiency of LLM execution. This blog provides a comprehensive guide to using vLLM, covering installation, model loading, text generation, batch processing, embeddings, and text classification.

By the end of this article, you will:

  • Understand how to install and set up vLLM.
  • Learn how to load and use LLMs efficiently with vLLM.
  • Explore batch processing for handling multiple prompts simultaneously.
  • Generate embeddings and perform text classification using vLLM.

Installation and Setup

Before using vLLM, install the library with the following command:

!pip install vllm

This command installs the necessary dependencies to start working with vLLM.

Initializing and Using vLLM

Loading a Model

To begin, load an LLM using vLLM. Here’s how you can load OPT-125M from Facebook’s model collection:

from vllm import LLM

llm = LLM(model="facebook/opt-125m")

This initializes an instance of the LLM, making it ready for inference.

Configuring Sampling Parameters

Sampling parameters control the randomness and diversity of text generation. Here’s how you can configure them:

from vllm import SamplingParams

sampling_params = SamplingParams(
 temperature=0.8, # Controls randomness; higher means more variation
 top_p=0.95, # Nucleus sampling; limits generated tokens
 max_tokens=256 # Maximum number of tokens in output
)

These settings influence the model’s output diversity and length.

Generating Text with vLLM

Now that the model is loaded and configured, let’s generate text from different prompts:

prompts = [
 "Hello, my name is",
 "The capital of France is",
 "The future of AI is",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
 prompt = output.prompt
 generated_text = output.outputs[0].text
 print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Expected Output

Prompt: 'Hello, my name is', Generated text: 'Alice and I love AI research.'
Prompt: 'The capital of France is', Generated text: 'Paris, a city known for its rich history and culture.'
Prompt: 'The future of AI is', Generated text: 'full of possibilities, revolutionizing industries worldwide.'

This demonstrates how vLLM efficiently generates coherent and contextually relevant text.

Batch Processing for Large Workloads

vLLM supports batch processing, enabling multiple prompts to be processed in parallel, improving efficiency.

prompts = [
 "What is the meaning of life?",
 "Write a short story about a cat.",
 "Translate 'hello' to Spanish.",
 "What is the capital of Japan?",
 "Explain the theory of relativity.",
 "Write a poem about the ocean.",
 "What is the highest mountain in the world?",
 "Write a Python function to calculate the factorial of a number.",
] * 10 # Expanding to 80 prompts

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
 prompt = output.prompt
 generated_text = output.outputs[0].text
 print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Expected Performance Output

Processed prompts: 100%|██████████| 80/80 [00:02<00:00, 37.47it/s, est. speed input: 346.85 toks/s, output: 3708.88 toks/s]

This output indicates efficient batch processing, with high token throughput.

🚀 Cohort Waitlist Open
Go From AI User to AI Builder

Don't just use ChatGPT. Learn to build custom LLM agents, RAG pipelines, and full-stack Agentic AI apps in our intensive 6-week program.

6 Weeks Live Mentorship
Deploy 5+ Real-world Apps
Weekly App Templates & Code
No Coding Experience Required
Explore Program
Join 1,000+ graduatesFree Registration

Generating Embeddings with vLLM

Embeddings convert text into numerical vectors, useful for NLP tasks like similarity comparison and clustering.

prompts = [
 "Hello, my name is",
 "The president of the United States is",
 "The capital of France is",
 "The future of AI is",
]

model = LLM(
 model="facebook/opt-125m",
 task="embed",
 enforce_eager=True,
)

outputs = model.embed(prompts)

for prompt, output in zip(prompts, outputs):
 embeds = output.outputs.embedding
 embeds_trimmed = ((str(embeds[:16])[:-1] + ", ...]") if len(embeds) > 16 else embeds)
 print(f"Prompt: {prompt!r} | Embeddings: {embeds_trimmed} (size={len(embeds)})")

Expected Output

Prompt: 'Hello, my name is' | Embeddings: [0.024, -0.017, 0.152, ..., 0.101] (size=768)

Text Classification with vLLM

Text classification categorizes input text into predefined classes.

prompts = [
 "Hello, my name is",
 "The president of the United States is",
 "The capital of France is",
 "The future of AI is",
]

model = LLM(
 model="facebook/opt-125m",
 task="classify",
 enforce_eager=True,
)

outputs = model.classify(prompts)

for prompt, output in zip(prompts, outputs):
 probs = output.outputs.probs
 probs_trimmed = ((str(probs[:16])[:-1] + ", ...]") if len(probs) > 16 else probs)
 print(f"Prompt: {prompt!r} | Class Probabilities: {probs_trimmed} (size={len(probs)})")

Expected Output

Prompt: 'The capital of France is' | Class Probabilities: [0.89, 0.05, 0.02, ...] (size=5)

Conclusion

vLLM is a powerful tool for fast and efficient LLM inference. Key takeaways:

  • It significantly improves speed and reduces memory usage.
  • Supports batch processing, real-time streaming, text generation, embeddings, and classification.
  • Open-source and easy to integrate into NLP pipelines.

Resources

---------------------------

Stay Updated:- Follow Build Fast with AI pages for all the latest AI updates and resources.

Experts predict 2025 will be the defining year for Gen AI Implementation. Want to be ahead of the curve?

Join Build Fast with AI’s Gen AI Launch Pad 2025 - your accelerated path to mastering AI tools and building revolutionary applications.

---------------------------

Resources and Community

Join our community of 12,000+ AI enthusiasts and learn to build powerful AI applications! Whether you're a beginner or an experienced developer, this tutorial will help you understand and implement AI agents in your projects.

Enjoyed this article? Share it →
Share:
You Might Also Like
👁 Tiktoken: High-Performance Tokenizer for OpenAI Models
Tools
Tiktoken: High-Performance Tokenizer for OpenAI Models

Unlock the power of tokenization with Tiktoken! Learn how this high-performance library helps you efficiently tokenize text for OpenAI models like GPT. From setup to encoding, decoding, and token management, discover how Tiktoken can optimize your AI projects.

👁 How FAISS is Revolutionizing Vector Search: Everything You Need to Know
Tools
How FAISS is Revolutionizing Vector Search: Everything You Need to Know

Discover FAISS, the ultimate library for fast similarity search and clustering of dense vectors! This in-depth guide covers setup, vector stores, document management, similarity search, and real-world applications. Master FAISS to build scalable, AI-powered search systems efficiently! 🚀