Dedicated Model Inference
Deploy models on dedicated infrastructure, engineered for speed
Purpose-built for teams who need control and the best economics in the market.
Why Dedicated Inferenceβ¨with Together AI?
Designed for production workloads that need β¨consistent performance and operational control.
Built for production inference
Scale to thousands of GPUs for always-on, production inference deployments.
Industry-leading unit economics
We provide the fastest deployments, enabling best price-performance on top GPUs.
Powered by frontier AI systems research
We continuously roll out the latest innovations to keep your deployments running fast.
Build with leading models
Explore top-performing models across text, image, video, code, and voice.
Chat
DeepSeek V4 Pro
Chat
Gemma-4-31B-it-Pearl
Chat
Qwen3.7-Max
Chat
NVIDIA Nemotron 3 Ultra
Chat
MiniMax M3
Chat
Kimi K2.7 Code
Chat
Qwen3.7-Plus
Chat
GLM-5.2
Chat
gpt-oss-120B
Chat
LFM2 24B A2B
Chat
Qwen3.5-397B-A17B
Chat
MiniMax M2.5
Chat
GLM-5
Chat
Qwen3-Coder-Next
Chat
Kimi K2.5
Image
Wan 2.6 Image
Image
GPT Image 1.5
Chat
Qwen3.5 9B
Chat
GLM-5.1
Chat
Gemma 4 31B
Have your own model?
Deploy custom containers on Togetherβs managed GPU infrastructure with automatic scaling, job queues, and built-in observability.
Key capabilities, purpose built for AI natives
Scale from self-serve instant clusters to thousands of GPUs, all optimized for better performance with Together Kernel Collection.
Adaptive speculative decoding
Cut latency on dedicated infrastructure with ATLAS β Together's AdapTive-LeArning Speculative System. Predict and validate multiple tokens per step to accelerate workloads continuously. No decoding bottlenecks.
Deploy in minutes
Launch dedicated endpoints in minutes by selecting a target model and hardware configuration. Establish production-ready inference environments without requiring deep infrastructure expertise.
Bring your own language model
Deploy custom models directly from Hugging Face or S3 onto dedicated endpoints via the UI or CLI. Maintain complete ownership while offloading infrastructure management.
Research that ships
Our research team doesn't just publish. They build the optimizations that power every inference request.
- Atlas
- Megakernel
- ThunderKittens
ATLAS performance
3.18x faster
ATLAS, our AdapTive-LeArning Speculator System, continuously learns from live traffic β outperforming static speculators and specialized hardware.
learn moreTogether AI CPD vs 2P1D
+40% throughput
Long-context inference without the latency penalty. CPD (cache-aware prefill-decode disaggregation) separates warm and cold requests, cutting time-to-first-token and boosting throughput by up to 40%.
learn moreMegakernel vs baseline
Up to 3.6x faster
Megakernel fuses an entire model's forward pass into a single GPU kernel. Made using the ThunderKittens framework, Megakernel eliminates the idle gaps between operations that rob GPUs of their full potential.
learn moreParallelKittens vs NCCL
Up to 1.79x faster
ParallelKittensβan extension to ThunderKittens for multi-GPU workloads developed in collaboration with Stanford's Hazy Labβcuts the synchronization overhead that large multi-GPU models pay on every single forward pass.
learn more
Deployment options
Run models using different deployment options depending on latency needs, traffic patterns, and infrastructure control.
Serverless
Inference
Serverless Inference
Real-time
A fully managed inference API that automatically scales with request volume.
Best for
Variable or unpredictable traffic
Rapid prototyping and iteration
Cost-sensitive or early-stage production workloads
Batch
Process massive workloads of up to 30 billion tokens asynchronously, at up to 50% less cost.
Best for
Classifying large datasets
Offline summarization
Synthetic data generation
Dedicated Inference
Dedicated Model Inference
An inference endpoint backed by reserved, isolated compute resources and the Together AI inference engine.
Best for
Predictable or steady traffic
Latency-sensitive applications
High-throughput production workloads
Dedicated Container Inference
Run inference with your own engine and model on fully-managed, scalable infrastructure.
Best for
Generative media models
Non-standard runtimes
Custom inference pipelines
Production-grade
security and data privacy
We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.
We take security and compliance seriously, with strict data privacy controls to keep your information protected. Your data and models remain fully under your ownership, safeguarded by robust security measures.
