VOOZH about

URL: https://bentoml.com

⇱ Bento: Run Inference at Scale


Inference On Your Terms

Inference Platform built for speed and control. Deploy any model anywhere, with tailored optimization, efficient scaling, and streamlined operations.

Scale Inference, Without Complexity

A complete platform that simplifies inference infrastructure while giving you full control over your deployment.

Deploy Any Model

Open Model Catalog

Deploy popular open-source models with a few clicks.

Llama 4
DeepSeek
Ling/Ring
Flux
Qwen
GPT-OSS

Custom Models

Unified framework for packaging and deploying models of any architecture, framework, or modality.

Fine-tuned open-source models
Your custom models

Manage Inference

Bento Inference Platform

A complete platform for managing, monitoring, and optimizing Al model inference.

Deployment automation and CI/CD
Comprehensive observability
Fine-grained access control
Resource and quota tracking
Performance tuning

Scale Efficiently

Bento Compute Engine

Intelligent resource management for optimal compute utilization.

Cross-region scaling
Elastic auto-scaling
Cold-start acceleration
Multi-cloud compute orchestration
Scaling-to-zero

Orchestrate Compute

Your Cloud

Complete control over your infrastructure and deployment environment.

Bring Your Own Cloud
On-Prem
Kubernetes

Bento Cloud

Access to cutting-edge GPU hardware without the procurement hassle.

Nvidia GPUs
AMD GPUs
B200
H100
MI300X
More...

Any Open Models

Build and launch faster than ever - easily run and scale any model with unified deployment across frameworks.

Open Source Model Launcher

Pre-optimized models for inference with day 1 access to newly released models.

Llama 4
DeepSeek
GPT-OSS
Ling/Ring
Flux
Qwen

Custom Model Serving

Deploy models of any architecture, framework, or modality with full customization.

vLLM
TRT-LLM
JAX
SGLang
PyTorch
Transformers
vllm_image=bentoml.images.Image(python_version='3.11').system_packages('curl','git').requirements_file('requirements.txt')@bentoml.service( image=vllm_image, resources={'gpu':1,'gpu_type':'nvidia-h100'},)classVLLM: model = bentoml.models.HuggingFaceModel("meta-llama/Meta-Llama-3.1-8B-Instruct")def__init__(self)->None:...@bentoml.apiasyncdefgenerate( self, prompt:str, max_tokens: typing_extensions.Annotated[int, annotated_types.Ge(128), annotated_types.Le(MAX_TOKENS)]= MAX_TOKENS,)-> typing.AsyncGenerator[str,None]:...

Production-Ready Inference, Now

A complete platform that simplifies inference infrastructure while giving you full control over your deployment.

Tailored Optimization

Bento’s inference stack is built for easy customization. Tune every layer of your deployment to balance speed, cost, and quality for your use case.

Optimize for your goals

Automatically find the optimal configuration based on your latency, throughput, or cost requirements.

Advanced performance tuning

Fine-tune every component to squeeze maximum efficiency from your hardware.

Distributed LLM inference

Run large models across multiple GPUs for faster, scalable inference.

Smart Scaling

AI inference workloads have unique scaling patterns that differ from traditional microservices. Our intelligent scaling adapts to inference-specific metrics and patterns for optimal resource utilization.

Auto-scale based on traffic

Intelligent scaling that adapts to demand patterns.

Blazing fast cold start

Ultra-fast initialization for responsive scaling.

Inference-specific metrics

Specialized scaling for auto-regressive models.

Advanced Serving Patterns

Choose the right serving architecture for your specific use case. From real-time interactions to large-scale batch processing, optimize your deployment for maximum efficiency.

Interactive applications

For chatbots, recommendations, and other sub-second latency AI features.

Async long-running tasks

Handle long-running AI tasks that don’t need instant results.

Large-scale batch inference

Batch and process large datasets while minimizing compute overhead.

Orchestrate complex workflows

Chain multiple models for advanced RAG and compound AI systems.

Faster Path to Production AI

Everything developers need to build, ship, and scale AI inference.

Dev Codespace

Iterate in the cloud as fast as you do locally

From local edits to instant cloud GPU runs in seconds

LLM Gateway

Unified interface for all LLM providers

One unified API for all LLMs, giving you centralized cost control and optimization

Streamlined Operations

Complete deployment lifecycle management

Version control with rollbacks, plus canary, shadow, and A/B testing for faster, safer releases

Full Observability

Comprehensive monitoring and insights

Track compute and performance, monitor LLM-specific metrics, and stay on top of system health

Built For Enterprise

Enterprise-grade security, compliance, and operational capabilities for mission-critical AI deployments.

Reliability

Infrastructure you can count on

Performance SLAs
24/7 monitoring
Uptime guarantee
Automatic failover

Forward Deployed Engineering

Dedicated technical experts for your team

Inference optimization research
Use case specific optimizations
Training & knowledge sharing
Continuous benchmarking

Data Sovereignty

Full control over your data

In Their Words

Hear from the teams who have transformed their AI/ML operations with BentoML.

BentoML enables our Data Science and Engineering teams to work independently, without the need for constant coordination. This allows us to build and deploy AI services with incredible efficiency while giving the ML Engineering team the flexibility to refactor when needed. What used to take days, now takes just hours.

Michael Misiewicz

Director of Data Science

BentoML's infrastructure gave us the platform we needed to launch our initial product and scale it without hiring any infrastructure engineers. Features like scale-to-zero and BYOC have saved us a considerable amount of money.

Patric Fulop

CTO, Neurolabs

With BentoML, we've been able to swiftly test new Al services based on the latest models, with the option to scale them up rapidly.

Massimiliano Ungheretti

Staff Data Scientist

Ready to accelerate your AI inference?

Talk to our engineers to discuss how we can help build an inference solution that’s faster, more cost-efficient, and tailored to your needs.

Our Blog

ModelsModels

The Best Open-Source LLMs in 2026

Read Full Article

ModelsModels

ChatGPT Usage Limits: What They Are and How to Get Rid of Them

Read Full Article

ModelsModels

The Complete Guide to DeepSeek Models: V3, R1, V4 and Beyond

Read Full Article