Most LLM deployments fail at the same choke points: developers get a model working locally, spin up a cloud instance, and then spend weeks firefighting latency spikes, OOM crashes, and unbounded GPU bills. This guide maps the full path from first ollama run to a load-balanced, monitored, auto-recovering production service.
For teams evaluating whether to run inference on edge hardware, cloud GPUs, or both, see the hybrid cloud-edge inference decision guide.
TL;DR
| Phase | What You Do | Tool | Cost Ballpark |
|---|---|---|---|
| Prototype | Run model locally | Ollama | $0 (local GPU/CPU) |
| Validate | Cloud GPU, real traffic test | vLLM + Spheron | ~$2/hr (H100 on-demand) |
| Optimize | Benchmark engines, tune flags | vLLM, llama.cpp | Same instance, no extra cost |
| Production | Systemd, health checks, monitoring | systemd + Prometheus | +$0 tooling on existing instance |
| Scale | Multi-instance, spot, sharding | Nginx LB + spot GPUs | 60-70% cost reduction vs on-demand |
Phase 1: Prototype Locally with Ollama
Before spending on cloud compute, validate that your chosen model actually meets your quality requirements. This phase costs nothing if you have a local GPU or CPU with enough RAM.
For a deeper look at running models locally, see our guide on running LLMs locally with Ollama. If you are developing on NVIDIA DGX Spark locally, the DGX Spark + GPU cloud pipeline covers the specific steps to move from local to production with the same Docker image.
Pick Your Model
Model size determines VRAM requirements and inference speed. Here is the practical decision table. If you are targeting newer large models like Llama 4, see deploying Llama 4 on GPU cloud for model-specific setup. For other popular models, check our DeepSeek V4 deployment guide, Qwen 3.6 Plus deployment guide, Gemma 4 deployment guide, and GLM 5.1 deployment guide.
| Model Size | VRAM Required (FP16) | VRAM Required (Q4) | Best For |
|---|---|---|---|
| 7B | ~14GB | ~5GB | Fast iteration, cost-sensitive API |
| 13B | ~26GB | ~8GB | Better quality, still single-GPU |
| 30B | ~60GB | ~18GB | High-quality outputs, multi-GPU or Q4 |
| 70B | ~140GB | ~35-40GB | Near-GPT-4 quality, needs H100 or multi-GPU |
If you are targeting a 7B or 13B model, a local machine with a 24GB GPU (RTX 3090 or 4090) handles it in FP16. For 30B+ you are looking at quantization locally, or moving to cloud for unquantized serving.
Run It with Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model (e.g. Llama 3.1 8B)
ollama pull llama3.1:8b
# Run interactively
ollama run llama3.1:8b
# Or test via API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Explain how KV cache works in transformer inference.",
"stream": false
}'What to Test in Phase 1
Work through this checklist before moving on:
- Output quality: Does the model answer your specific domain questions correctly? Run 20-30 prompts representative of your actual workload.
- Latency on your hardware: Note time-to-first-token (TTFT) and tokens/second. This is your baseline before GPU cloud.
- Context window limits: Test with your longest expected prompts. Does quality degrade at 4K tokens? 8K? 16K?
- Model behavior with your prompts: Does it follow your system prompt? Does it hallucinate on your domain?
When to Move On
Move to Phase 2 when any of these are true:
- Your p50 TTFT exceeds 1 second under single-user load.
- You need to handle more than 5 concurrent users.
- Your target model requires more VRAM than your local GPU has.
- You need 24/7 availability without tying up your laptop.
Phase 2: Validate on a Cloud GPU with vLLM
Phase 2 is about getting a realistic read on throughput and cost before committing to production architecture. Pick a GPU, deploy vLLM, and run a real load test.
For a full breakdown of inference GPUs by cost-per-token, see the AI inference GPU comparison.
Choose Your GPU
Match your model size to the right GPU. Overpaying for a bigger GPU than you need is the most common Phase 2 mistake.
| Model Size | Recommended GPU | VRAM | On-Demand Price | Spot Price |
|---|---|---|---|---|
| 7B (FP16) | L40S | 48GB | $0.72/hr | N/A |
| 13B (FP16) | L40S | 48GB | $0.72/hr | N/A |
| 30B (FP16) | A100 SXM4 80GB | 80GB | $1.07/hr | $0.60/hr |
| 70B (FP8) | H100 SXM5 80GB | 80GB | $2.50/hr | $1.03/hr |
| 70B (FP16) | 2x H100 SXM5 80GB | 160GB | $5.00/hr | $2.06/hr |
Prices as of 15 Apr 2026 on Spheron GPU rental. See current GPU pricing for live rates.
Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Spin Up a Spheron Instance
- Log in to app.spheron.ai and select your GPU model from the catalog. See the Spheron GPU pricing for a full list of available GPU instances and regions.
- Choose your region and provider, then launch the instance.
- SSH into the instance and verify the GPU is visible:
nvidia-smiYou should see your GPU with the expected VRAM (e.g., 80034MiB for an H100 80GB). If you provisioned multiple GPUs, all should appear.
- Confirm Docker and the NVIDIA Container Toolkit are installed:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smiDeploy with vLLM
For a complete guide on replacing OpenAI API calls with your own vLLM server, see self-hosted OpenAI-compatible API with vLLM.
Run the vLLM OpenAI-compatible server. This single command covers the common case for a 7B model on an L40S:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--max-num-seqs 256Key flags:
--ipc=host: required for shared memory between GPU processes. Skipping causes CUDA errors under load.--gpu-memory-utilization 0.90: leaves 10% headroom for CUDA overhead. Go to 0.92-0.95 if you need more KV cache space.--max-num-seqs 256: maximum concurrent sequences in a batch. Raise this for high-throughput workloads.--max-model-len 8192: context window limit. Lower this to reduce KV cache memory pressure if you do not need long contexts.
For a 70B model in FP8 on H100:
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--dtype fp8 \
--gpu-memory-utilization 0.92 \
--max-model-len 4096 \
--max-num-seqs 64Test the deployment:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello, are you running?"}]
}'Load Test with Real Traffic
Single-request latency does not predict production performance. Run a load test to measure throughput under concurrency:
# Install locust
pip install locust
# locustfile.py
from locust import HttpUser, task, between
class LLMUser(HttpUser):
wait_time = between(0.1, 0.5)
@task
def chat(self):
self.client.post("/v1/chat/completions", json={
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Write a two-sentence summary of transformer architecture."}],
"max_tokens": 100
}, timeout=120)
# Run with 50 concurrent users
locust -f locustfile.py --headless -u 50 -r 5 --host http://localhost:8000 --run-time 60sCapture these numbers before moving to Phase 3:
- Throughput: requests/second and tokens/second at steady state
- p50 and p95 TTFT: time-to-first-token at median and 95th percentile
- p95 end-to-end latency: total request time including generation
- Error rate: any 429 (queue full) or 500 errors at your target concurrency
Pass criteria: if your p95 TTFT is under your SLA at your target concurrency, you have the right GPU. If not, either increase --max-num-seqs, upgrade the GPU, or add a second instance.
Phase 2 Cost Snapshot
| Configuration | Tokens/sec | On-Demand $/hr | Cost per 1M Tokens |
|---|---|---|---|
| 7B on L40S | 2,500-4,000 | $0.72 | ~$0.050-0.080 |
| 30B on A100 80GB | 800-1,500 | $1.07 | ~$0.198-0.370 |
| 70B (FP8) on H100 | 400-700 | $2.50 | ~$0.99-1.74 |
| 70B (FP16) on 2x H100 | 600-1,000 | $5.00 | ~$1.39-2.31 |
Phase 3: Optimize Your Inference Engine
Phase 2 confirmed your GPU size and baseline throughput. Phase 3 is about getting more out of that hardware before you harden it for production.
For a detailed comparison of inference frameworks including benchmark numbers, see the full vLLM production deployment guide.
vLLM vs llama.cpp vs TGI: A Practical Comparison
| Factor | vLLM | llama.cpp | TGI (Text Generation Inference) |
|---|---|---|---|
| Throughput (concurrent) | Highest | Medium | High |
| p50 latency | Low | Low | Low |
| Multi-GPU support | Yes (tensor + pipeline) | Limited | Yes |
| Quantization | FP8, INT4, GPTQ, AWQ | GGUF (Q2-Q8) | GPTQ, AWQ, FP8 |
| Ops complexity | Medium | Low | Medium |
| Best for | Production batch/API serving | Single-user, CPU inference | Hugging Face-native stack |
The right answer for most production deployments is vLLM. llama.cpp is worth considering for CPU-only or very low-concurrency use cases where its GGUF quantization formats are a better fit. TGI is a viable alternative if you are already on the Hugging Face stack and want tighter integration with their ecosystem.
Tuning vLLM for Your Workload
Start with the defaults, then tune based on your Phase 2 load test results:
# High-throughput batch workload
docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 512 \
--max-model-len 4096 \
--performance-mode throughput
# Low-latency interactive workload
docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.85 \
--max-num-seqs 64 \
--max-model-len 8192 \
--performance-mode interactivityKey flags to tune:
--gpu-memory-utilization: higher value means more KV cache space and higher concurrency ceiling, at the risk of OOM. Tune in 0.02 increments.--max-num-seqs: directly caps concurrent sequences. Set to 2-3x your expected peak concurrency.--dtype fp8: use on H100 and Blackwell GPUs. Gives ~1.5-2x throughput improvement. Not available on older GPU architectures.--performance-mode: new in vLLM v0.17.0.throughputfavors batching,interactivityfavors TTFT,balancedis the default.--kv-cache-dtype fp8: also store the KV cache in FP8 on H100. Saves additional VRAM, enabling longer context windows or more concurrent sequences.
For workloads where GPU KV cache still fills after FP8 quantization, see NVMe KV Cache Offloading for LLM Inference for adding a third NVMe storage tier with LMCache.
Quantization Trade-offs
| Precision | VRAM for 70B | Throughput (relative) | Quality Impact |
|---|---|---|---|
| FP16 | ~140GB | 1x baseline | No loss |
| FP8 | ~70GB | 1.5-2x | Less than 1-2% on benchmarks |
| INT4 (AWQ) | ~40GB | 1.2-1.5x | 2-5% on benchmarks, varies by model |
| GGUF Q4_K_M | ~38GB | 0.6-1x | 3-6% on benchmarks, CPU-friendly |
FP8 is the practical choice on H100 and Blackwell. The throughput gain is real and the quality loss is marginal for most use cases. INT4 is worth considering if you need a 70B model on hardware with less than 70GB VRAM (e.g., A100 80GB with tight fit).
Benchmark Your Setup
After tuning, capture baseline numbers you can compare against later:
# GPU-level metrics (run in a separate terminal while vLLM is under load)
nvidia-smi dmon -s pucvmet -d 5
# vLLM internal metrics (raw Prometheus format)
curl http://localhost:8000/metrics | grep -E "num_requests_waiting|gpu_cache_usage|time_to_first_token"Numbers to record before Phase 4:
- GPU compute utilization at peak load
- GPU memory bandwidth utilization at peak load
vllm:gpu_cache_usage_percat peak loadvllm:time_to_first_token_secondsp50 and p95
Phase 4: Harden for Production
You have a working, tuned vLLM deployment. Phase 4 is about making it survive a long weekend without manual intervention.
For architecture patterns that apply to any GPU production workload, see our production GPU cloud architecture patterns guide.
Systemd Service Unit
Wrap the Docker container in a systemd service so it restarts automatically on crashes, reboots, and OOM kills:
# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM OpenAI-compatible inference server
After=docker.service
Requires=docker.service
[Service]
Type=simple
Restart=always
RestartSec=10
ExecStartPre=-/usr/bin/docker stop vllm-server
ExecStartPre=-/usr/bin/docker rm vllm-server
ExecStart=/usr/bin/docker run \
--name vllm-server \
--gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 256 \
--max-model-len 8192
ExecStop=/usr/bin/docker stop vllm-server
[Install]
WantedBy=multi-user.targetEnable and start:
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
sudo systemctl status vllmHealth Checks
vLLM exposes a /health endpoint. Poll it and trigger a service restart if it stops responding:
# /usr/local/bin/vllm-healthcheck.sh
#!/bin/bash
# Skip restart if the service has been active for less than 15 minutes (model loading grace period)
active_since=$(systemctl show vllm --property=ActiveEnterTimestamp --value 2>/dev/null)
if [ -n "$active_since" ]; then
start_epoch=$(date -d "$active_since" +%s 2>/dev/null); [ -z "$start_epoch" ] && exit 0
now_epoch=$(date +%s)
uptime_seconds=$((now_epoch - start_epoch))
if [ "$uptime_seconds" -lt 900 ]; then
echo "vLLM has been up for ${uptime_seconds}s, within grace period, skipping health check"
exit 0
fi
fi
# Also skip if service is still in activating state (i.e., loading)
service_state=$(systemctl is-active vllm 2>/dev/null)
if [ "$service_state" = "activating" ]; then
echo "vLLM is still activating, skipping health check"
exit 0
fi
response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 --connect-timeout 5 http://localhost:8000/health)
if [ "$response" != "200" ]; then
echo "vLLM health check failed (HTTP $response), restarting service"
systemctl restart vllm
fiAdd a cron job to run this every minute:
chmod +x /usr/local/bin/vllm-healthcheck.sh
echo "* * * * * root /usr/local/bin/vllm-healthcheck.sh >> /var/log/vllm-healthcheck.log 2>&1" \
| sudo tee /etc/cron.d/vllm-healthcheckMonitoring Setup
For GPU monitoring details including DCGM and Grafana setup, see GPU monitoring with Prometheus and Grafana.
Add a Prometheus scrape job for vLLM metrics:
# prometheus.yml (add to scrape_configs)
scrape_configs:
- job_name: 'vllm'
scrape_interval: 10s
static_configs:
- targets: ['localhost:8000']
metrics_path: '/metrics'Three alerts to configure from the start:
# vllm_alerts.yml
groups:
- name: vllm
rules:
- alert: VLLMQueueDepth
expr: vllm:num_requests_waiting > 20
for: 2m
labels:
severity: warning
annotations:
summary: "vLLM request queue is backing up"
- alert: VLLMKVCachePressure
expr: vllm:gpu_cache_usage_perc > 0.90
for: 5m
labels:
severity: warning
annotations:
summary: "vLLM KV cache above 90%, consider scaling"
- alert: VLLMHighTTFT
expr: histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket[5m])) by (le)) > 2.0
for: 3m
labels:
severity: critical
annotations:
summary: "vLLM p95 TTFT above 2s SLA"Load Balancing Two Instances
For redundancy or throughput beyond a single GPU, add an Nginx upstream block:
# /etc/nginx/sites-available/vllm
upstream vllm_backend {
least_conn;
server 10.0.0.1:8000;
server 10.0.0.2:8000;
}
server {
listen 80;
location / {
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 300s;
proxy_buffering off;
}
}least_conn is the right load balancing strategy for LLM inference because request duration varies significantly. Round-robin can pile long requests onto one backend while the other sits idle.
Phase 4 Architecture Diagram
Client
|
v
Nginx (least_conn load balancer, port 80)
| |
v v
vLLM-1 vLLM-2
(port 8000) (port 8000)
| |
v v
GPU-1 GPU-2
| |
v v
Prometheus Prometheus
(scrape /metrics every 10s)Phase 5: Scale
Single-instance is fine for development and low-traffic production. Phase 5 is for when your traffic outgrows it.
For GPU cost reduction strategies that apply across all phases, see GPU cost optimization strategies.
Horizontal Scaling: When and How
Calculate how many instances you need:
instances_needed = ceil(peak_rps / single_instance_rps)Example: if your load test showed a single L40S handles 12 requests/second at p95 TTFT under 500ms, and your peak traffic is 60 requests/second, you need:
ceil(60 / 12) = 5 L40S instancesAdd 20-30% buffer for traffic spikes: plan for 6-7 instances.
For stateless LLM inference (no session affinity), horizontal scaling is straightforward. Each instance runs an independent vLLM server. Nginx distributes load across all of them. No shared state to coordinate.
Model Sharding for 70B+ Models
When a model does not fit on a single GPU even in FP8, use tensor parallelism to split it across multiple GPUs on the same host:
# 70B model in FP16 across 2x H100 80GB
docker run --gpus all \
--ipc=host \
-p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3.1-70B-Instruct \
--dtype float16 \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 128When to use tensor parallelism vs adding more single-GPU instances:
- Tensor parallelism: when the model does not fit on one GPU, or when you want to reduce TTFT (prefill is parallelized across GPUs).
- More instances: when throughput is the bottleneck and the model fits on one GPU. Multiple independent instances scale throughput linearly with no NVLink overhead.
For 140B+ models, consider pipeline parallelism (--pipeline-parallel-size) in addition to tensor parallelism. Pipeline parallelism assigns different transformer layers to different GPUs, which works better when NVLink bandwidth is a bottleneck.
Spot Instances for Cost Reduction
For production inference on a stable traffic pattern, spot instances cut costs significantly. On Spheron, spot pricing for H100 and L40S on-demand varies by availability, but savings typically run 50-70% versus on-demand:
| GPU | On-Demand $/hr | Spot $/hr (approx.) | Monthly Savings (1 instance) |
|---|---|---|---|
| H100 SXM5 80GB | $2.40 | ~$0.80-1.00 | ~$1,008-1,152 |
| L40S 48GB | $1.80 | ~$0.30-0.35 | ~$1,044-1,080 |
| A100 SXM4 80GB | $1.05 | ~$0.40-0.55 | ~$360-468 |
To use spot with vLLM safely, the service must handle interruptions gracefully. LLM inference is stateless (no in-flight requests survive an interruption), so the main concern is in-flight requests at the moment of preemption. A reasonable approach:
- Run spot instances behind the Nginx load balancer.
- Set a short connection drain timeout (10-30 seconds) so Nginx stops sending new requests to a preempting instance.
- Keep at least one on-demand instance in the pool for reliability. Blend spot and on-demand based on your availability tolerance.
Cost at Scale: Projections
| Tier | Configuration | On-Demand $/month | Spot $/month (approx.) |
|---|---|---|---|
| Dev / Low traffic | 1x L40S | $518 | N/A |
| Small production | 2x H100 SXM5 | $3,600 | ~$1,483 |
| Scale production | 4x H100 SXM5 (spot blend) | $7,200 | ~$2,966 |
See H100 GPU rental and View all GPU pricing for current rates before budgeting.
Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.
Related Resources
The five phases above each have deeper coverage in related posts:
- Running LLMs locally with Ollama covers Phase 1 model selection and local testing in more depth.
- Full vLLM production deployment guide covers multi-GPU tensor parallelism, FP8, and production monitoring in much more detail than Phase 2-3 above.
- AI inference GPU comparison has cost-per-token benchmarks across GPU models for Phase 2 GPU selection.
- GPU monitoring with Prometheus and Grafana covers the full Prometheus and Grafana setup referenced in Phase 4.
- Production GPU cloud architecture patterns covers failover, checkpointing, and multi-provider redundancy beyond what Phase 4 covers.
- GPU cost optimization strategies covers reserved instances, spot strategies, and idle GPU elimination in detail for Phase 5.
Every phase in this guide runs on Spheron, from a single L40S for vLLM validation to multi-GPU H100 configurations for production. Per-minute billing means you only pay for the phases you are in, not idle capacity between them.
Spheron H100 → | On-demand L40S → | View all GPU pricing → | Get started on Spheron →
Quick Setup Guide
Phase 1: Run the model locally with Ollama
Install Ollama on your local machine and pull your target model with `ollama pull`. Run it with `ollama run` and send test prompts to measure output quality, latency, and behavior at your target context length. This phase costs nothing and lets you validate model selection before spending on cloud compute. Move to Phase 2 when you need concurrent users, faster inference, or larger models than your local hardware supports.
Phase 2: Move to a cloud GPU and validate with real traffic
Provision a GPU instance on Spheron that matches your model size: L40S at $0.72/hr for 7B-30B models, H100 at $2.50/hr for 70B models. Deploy vLLM via Docker with the OpenAI-compatible server. Run a load test using locust or wrk to measure throughput (tokens/sec), latency (TTFT and inter-token latency), and error rate under your expected concurrent user count. This phase surfaces memory pressure, batching limits, and any model-specific configuration issues before you invest in production hardening.
Phase 3: Benchmark and tune your inference engine
Compare vLLM against llama.cpp and TGI on your specific model and workload. Run the same load test against each engine and measure throughput, p50/p95 latency, VRAM usage, and operational complexity. For most production deployments, vLLM wins on throughput and operational simplicity. Tune vLLM flags: --gpu-memory-utilization 0.90, --max-num-seqs for your concurrency target, --dtype fp8 on H100/Blackwell for a 1.5-2x throughput boost, and --performance-mode throughput for batch-heavy workloads.
Phase 4: Harden for production
Wrap vLLM in a systemd service unit with automatic restart on failure. Add a health check script that polls the /health endpoint and triggers a service restart if it returns non-200. Configure Prometheus to scrape vLLM's /metrics endpoint and set alerts on queue depth (vllm:num_requests_waiting), KV cache fill rate (vllm:gpu_cache_usage_perc), and time-to-first-token (vllm:time_to_first_token_seconds). For more than one instance, add an Nginx upstream block to distribute load across instances.
Phase 5: Scale across instances
Calculate the number of instances you need using the formula: instances = ceil(peak_rps / single_instance_rps). For 70B+ models that exceed single-GPU VRAM even in FP8, use --tensor-parallel-size to split the model across multiple GPUs on the same host. For cost reduction on stable workloads, move to spot instances with checkpoint-based restart: spot H100s on Spheron cost roughly 60% less than on-demand, and a properly configured vLLM service recovers automatically after a spot interruption.
Frequently Asked Questions
Start by running the model locally with Ollama to validate output quality and latency. Then provision a GPU cloud instance on Spheron (L40S at $0.72/hr for 7B-30B models, H100 at $2.50/hr for 70B), deploy vLLM via Docker, and run load tests to measure real throughput. Once you hit your latency targets, add a systemd service for auto-restart, configure Prometheus monitoring on vLLM's /metrics endpoint, and set up an Nginx load balancer if you need more than one instance. Total time from zero to a hardened single-instance deployment: 4-8 hours.
Use FP8 quantization on a single H100 SXM5 80GB at $2.50/hr on-demand, which gets 70B weights to ~70GB and fits on one GPU. Enable with --dtype fp8 in vLLM. For sustained production traffic, spot instances on Spheron at $1.03/hr can reduce costs significantly with checkpoint-based restart on interruption. At 720 hours per month, on-demand costs $1,800/month for a single 70B inference instance, while spot runs $741.60/month. Scaling to two spot H100s for redundancy runs $1,483/month versus $3,600/month for two on-demand H100s.
Ollama is for local development and prototyping. It is easy to install, runs on consumer GPUs and CPUs, and is great for testing a model before committing to a cloud deployment. It does not support PagedAttention-based continuous batching or multi-GPU tensor parallelism, which means it saturates quickly under concurrent load. vLLM is for production. It supports continuous batching (handles hundreds of concurrent requests efficiently), FP8 quantization on H100 and Blackwell GPUs, multi-GPU tensor parallelism, and exposes a Prometheus metrics endpoint. Use Ollama in Phase 1 to pick your model, then switch to vLLM in Phase 2 when you move to cloud.
This depends on your model size, average request length, and acceptable latency. A practical estimate: a single H100 running vLLM with a 7B model handles roughly 80-120 concurrent requests at under 200ms TTFT (time to first token). For 1,000 concurrent users with a 7B model, plan for 8-12 H100 instances or use a larger model on fewer GPUs with higher per-request latency. For a 70B model, throughput drops to 15-30 concurrent requests per H100, so 1,000 concurrent users needs 35-70 H100 instances. Load test with your specific model and target SLA before sizing for production.
On Spheron, an L40S instance (48GB VRAM, sufficient for 7B-30B models in FP16) costs $0.72/hr on-demand as of April 2026. At full utilization running 24/7, that is $518/month. A 7B model on an L40S with vLLM handles roughly 80-120 concurrent requests and produces 2,000-4,000 tokens/second. At $0.72/hr and 3,000 tokens/second average throughput, the cost per million tokens is approximately $0.067. For context, the same throughput on AWS would cost roughly $2-4/hr for a comparable GPU, making the cost per million tokens 3-6x higher.
