VOOZH about

URL: https://www.spheron.network/blog/llm-deployment-guide/

⇱ From Prototype to Production: A Complete LLM Deployment Guide | Spheron Blog


Most LLM deployments fail at the same choke points: developers get a model working locally, spin up a cloud instance, and then spend weeks firefighting latency spikes, OOM crashes, and unbounded GPU bills. This guide maps the full path from first ollama run to a load-balanced, monitored, auto-recovering production service.

For teams evaluating whether to run inference on edge hardware, cloud GPUs, or both, see the hybrid cloud-edge inference decision guide.

TL;DR

PhaseWhat You DoToolCost Ballpark
PrototypeRun model locallyOllama$0 (local GPU/CPU)
ValidateCloud GPU, real traffic testvLLM + Spheron~$2/hr (H100 on-demand)
OptimizeBenchmark engines, tune flagsvLLM, llama.cppSame instance, no extra cost
ProductionSystemd, health checks, monitoringsystemd + Prometheus+$0 tooling on existing instance
ScaleMulti-instance, spot, shardingNginx LB + spot GPUs60-70% cost reduction vs on-demand

Phase 1: Prototype Locally with Ollama

Before spending on cloud compute, validate that your chosen model actually meets your quality requirements. This phase costs nothing if you have a local GPU or CPU with enough RAM.

For a deeper look at running models locally, see our guide on running LLMs locally with Ollama. If you are developing on NVIDIA DGX Spark locally, the DGX Spark + GPU cloud pipeline covers the specific steps to move from local to production with the same Docker image.

Pick Your Model

Model size determines VRAM requirements and inference speed. Here is the practical decision table. If you are targeting newer large models like Llama 4, see deploying Llama 4 on GPU cloud for model-specific setup. For other popular models, check our DeepSeek V4 deployment guide, Qwen 3.6 Plus deployment guide, Gemma 4 deployment guide, and GLM 5.1 deployment guide.

Model SizeVRAM Required (FP16)VRAM Required (Q4)Best For
7B~14GB~5GBFast iteration, cost-sensitive API
13B~26GB~8GBBetter quality, still single-GPU
30B~60GB~18GBHigh-quality outputs, multi-GPU or Q4
70B~140GB~35-40GBNear-GPT-4 quality, needs H100 or multi-GPU

If you are targeting a 7B or 13B model, a local machine with a 24GB GPU (RTX 3090 or 4090) handles it in FP16. For 30B+ you are looking at quantization locally, or moving to cloud for unquantized serving.

Run It with Ollama

bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (e.g. Llama 3.1 8B)
ollama pull llama3.1:8b

# Run interactively
ollama run llama3.1:8b

# Or test via API
curl http://localhost:11434/api/generate -d '{
 "model": "llama3.1:8b",
 "prompt": "Explain how KV cache works in transformer inference.",
 "stream": false
}'

What to Test in Phase 1

Work through this checklist before moving on:

  • Output quality: Does the model answer your specific domain questions correctly? Run 20-30 prompts representative of your actual workload.
  • Latency on your hardware: Note time-to-first-token (TTFT) and tokens/second. This is your baseline before GPU cloud.
  • Context window limits: Test with your longest expected prompts. Does quality degrade at 4K tokens? 8K? 16K?
  • Model behavior with your prompts: Does it follow your system prompt? Does it hallucinate on your domain?

When to Move On

Move to Phase 2 when any of these are true:

  • Your p50 TTFT exceeds 1 second under single-user load.
  • You need to handle more than 5 concurrent users.
  • Your target model requires more VRAM than your local GPU has.
  • You need 24/7 availability without tying up your laptop.

Phase 2: Validate on a Cloud GPU with vLLM

Phase 2 is about getting a realistic read on throughput and cost before committing to production architecture. Pick a GPU, deploy vLLM, and run a real load test.

For a full breakdown of inference GPUs by cost-per-token, see the AI inference GPU comparison.

Choose Your GPU

Match your model size to the right GPU. Overpaying for a bigger GPU than you need is the most common Phase 2 mistake.

Model SizeRecommended GPUVRAMOn-Demand PriceSpot Price
7B (FP16)L40S48GB$0.72/hrN/A
13B (FP16)L40S48GB$0.72/hrN/A
30B (FP16)A100 SXM4 80GB80GB$1.07/hr$0.60/hr
70B (FP8)H100 SXM5 80GB80GB$2.50/hr$1.03/hr
70B (FP16)2x H100 SXM5 80GB160GB$5.00/hr$2.06/hr

Prices as of 15 Apr 2026 on Spheron GPU rental. See current GPU pricing for live rates.

Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.

Spin Up a Spheron Instance

  1. Log in to app.spheron.ai and select your GPU model from the catalog. See the Spheron GPU pricing for a full list of available GPU instances and regions.
  2. Choose your region and provider, then launch the instance.
  3. SSH into the instance and verify the GPU is visible:
bash
nvidia-smi

You should see your GPU with the expected VRAM (e.g., 80034MiB for an H100 80GB). If you provisioned multiple GPUs, all should appear.

  1. Confirm Docker and the NVIDIA Container Toolkit are installed:
bash
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Deploy with vLLM

For a complete guide on replacing OpenAI API calls with your own vLLM server, see self-hosted OpenAI-compatible API with vLLM.

Run the vLLM OpenAI-compatible server. This single command covers the common case for a 7B model on an L40S:

bash
docker run --gpus all \
 --ipc=host \
 -p 8000:8000 \
 vllm/vllm-openai:latest \
 --model meta-llama/Llama-3.1-8B-Instruct \
 --dtype float16 \
 --gpu-memory-utilization 0.90 \
 --max-model-len 8192 \
 --max-num-seqs 256

Key flags:

  • --ipc=host: required for shared memory between GPU processes. Skipping causes CUDA errors under load.
  • --gpu-memory-utilization 0.90: leaves 10% headroom for CUDA overhead. Go to 0.92-0.95 if you need more KV cache space.
  • --max-num-seqs 256: maximum concurrent sequences in a batch. Raise this for high-throughput workloads.
  • --max-model-len 8192: context window limit. Lower this to reduce KV cache memory pressure if you do not need long contexts.

For a 70B model in FP8 on H100:

bash
docker run --gpus all \
 --ipc=host \
 -p 8000:8000 \
 vllm/vllm-openai:latest \
 --model meta-llama/Llama-3.1-70B-Instruct \
 --dtype fp8 \
 --gpu-memory-utilization 0.92 \
 --max-model-len 4096 \
 --max-num-seqs 64

Test the deployment:

bash
curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{
 "model": "meta-llama/Llama-3.1-8B-Instruct",
 "messages": [{"role": "user", "content": "Hello, are you running?"}]
 }'

Load Test with Real Traffic

Single-request latency does not predict production performance. Run a load test to measure throughput under concurrency:

bash
# Install locust
pip install locust

# locustfile.py
from locust import HttpUser, task, between

class LLMUser(HttpUser):
 wait_time = between(0.1, 0.5)

 @task
 def chat(self):
 self.client.post("/v1/chat/completions", json={
 "model": "meta-llama/Llama-3.1-8B-Instruct",
 "messages": [{"role": "user", "content": "Write a two-sentence summary of transformer architecture."}],
 "max_tokens": 100
 }, timeout=120)

# Run with 50 concurrent users
locust -f locustfile.py --headless -u 50 -r 5 --host http://localhost:8000 --run-time 60s

Capture these numbers before moving to Phase 3:

  • Throughput: requests/second and tokens/second at steady state
  • p50 and p95 TTFT: time-to-first-token at median and 95th percentile
  • p95 end-to-end latency: total request time including generation
  • Error rate: any 429 (queue full) or 500 errors at your target concurrency

Pass criteria: if your p95 TTFT is under your SLA at your target concurrency, you have the right GPU. If not, either increase --max-num-seqs, upgrade the GPU, or add a second instance.

Phase 2 Cost Snapshot

ConfigurationTokens/secOn-Demand $/hrCost per 1M Tokens
7B on L40S2,500-4,000$0.72~$0.050-0.080
30B on A100 80GB800-1,500$1.07~$0.198-0.370
70B (FP8) on H100400-700$2.50~$0.99-1.74
70B (FP16) on 2x H100600-1,000$5.00~$1.39-2.31

Phase 3: Optimize Your Inference Engine

Phase 2 confirmed your GPU size and baseline throughput. Phase 3 is about getting more out of that hardware before you harden it for production.

For a detailed comparison of inference frameworks including benchmark numbers, see the full vLLM production deployment guide.

vLLM vs llama.cpp vs TGI: A Practical Comparison

FactorvLLMllama.cppTGI (Text Generation Inference)
Throughput (concurrent)HighestMediumHigh
p50 latencyLowLowLow
Multi-GPU supportYes (tensor + pipeline)LimitedYes
QuantizationFP8, INT4, GPTQ, AWQGGUF (Q2-Q8)GPTQ, AWQ, FP8
Ops complexityMediumLowMedium
Best forProduction batch/API servingSingle-user, CPU inferenceHugging Face-native stack

The right answer for most production deployments is vLLM. llama.cpp is worth considering for CPU-only or very low-concurrency use cases where its GGUF quantization formats are a better fit. TGI is a viable alternative if you are already on the Hugging Face stack and want tighter integration with their ecosystem.

Tuning vLLM for Your Workload

Start with the defaults, then tune based on your Phase 2 load test results:

bash
# High-throughput batch workload
docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest \
 --model meta-llama/Llama-3.1-8B-Instruct \
 --dtype float16 \
 --gpu-memory-utilization 0.92 \
 --max-num-seqs 512 \
 --max-model-len 4096 \
 --performance-mode throughput

# Low-latency interactive workload
docker run --gpus all --ipc=host -p 8000:8000 vllm/vllm-openai:latest \
 --model meta-llama/Llama-3.1-8B-Instruct \
 --dtype float16 \
 --gpu-memory-utilization 0.85 \
 --max-num-seqs 64 \
 --max-model-len 8192 \
 --performance-mode interactivity

Key flags to tune:

  • --gpu-memory-utilization: higher value means more KV cache space and higher concurrency ceiling, at the risk of OOM. Tune in 0.02 increments.
  • --max-num-seqs: directly caps concurrent sequences. Set to 2-3x your expected peak concurrency.
  • --dtype fp8: use on H100 and Blackwell GPUs. Gives ~1.5-2x throughput improvement. Not available on older GPU architectures.
  • --performance-mode: new in vLLM v0.17.0. throughput favors batching, interactivity favors TTFT, balanced is the default.
  • --kv-cache-dtype fp8: also store the KV cache in FP8 on H100. Saves additional VRAM, enabling longer context windows or more concurrent sequences.

For workloads where GPU KV cache still fills after FP8 quantization, see NVMe KV Cache Offloading for LLM Inference for adding a third NVMe storage tier with LMCache.

Quantization Trade-offs

PrecisionVRAM for 70BThroughput (relative)Quality Impact
FP16~140GB1x baselineNo loss
FP8~70GB1.5-2xLess than 1-2% on benchmarks
INT4 (AWQ)~40GB1.2-1.5x2-5% on benchmarks, varies by model
GGUF Q4_K_M~38GB0.6-1x3-6% on benchmarks, CPU-friendly

FP8 is the practical choice on H100 and Blackwell. The throughput gain is real and the quality loss is marginal for most use cases. INT4 is worth considering if you need a 70B model on hardware with less than 70GB VRAM (e.g., A100 80GB with tight fit).

Benchmark Your Setup

After tuning, capture baseline numbers you can compare against later:

bash
# GPU-level metrics (run in a separate terminal while vLLM is under load)
nvidia-smi dmon -s pucvmet -d 5

# vLLM internal metrics (raw Prometheus format)
curl http://localhost:8000/metrics | grep -E "num_requests_waiting|gpu_cache_usage|time_to_first_token"

Numbers to record before Phase 4:

  • GPU compute utilization at peak load
  • GPU memory bandwidth utilization at peak load
  • vllm:gpu_cache_usage_perc at peak load
  • vllm:time_to_first_token_seconds p50 and p95

Phase 4: Harden for Production

You have a working, tuned vLLM deployment. Phase 4 is about making it survive a long weekend without manual intervention.

For architecture patterns that apply to any GPU production workload, see our production GPU cloud architecture patterns guide.

Systemd Service Unit

Wrap the Docker container in a systemd service so it restarts automatically on crashes, reboots, and OOM kills:

ini
# /etc/systemd/system/vllm.service
[Unit]
Description=vLLM OpenAI-compatible inference server
After=docker.service
Requires=docker.service

[Service]
Type=simple
Restart=always
RestartSec=10
ExecStartPre=-/usr/bin/docker stop vllm-server
ExecStartPre=-/usr/bin/docker rm vllm-server
ExecStart=/usr/bin/docker run \
 --name vllm-server \
 --gpus all \
 --ipc=host \
 -p 8000:8000 \
 vllm/vllm-openai:latest \
 --model meta-llama/Llama-3.1-8B-Instruct \
 --dtype float16 \
 --gpu-memory-utilization 0.90 \
 --max-num-seqs 256 \
 --max-model-len 8192
ExecStop=/usr/bin/docker stop vllm-server

[Install]
WantedBy=multi-user.target

Enable and start:

bash
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
sudo systemctl status vllm

Health Checks

vLLM exposes a /health endpoint. Poll it and trigger a service restart if it stops responding:

bash
# /usr/local/bin/vllm-healthcheck.sh
#!/bin/bash

# Skip restart if the service has been active for less than 15 minutes (model loading grace period)
active_since=$(systemctl show vllm --property=ActiveEnterTimestamp --value 2>/dev/null)
if [ -n "$active_since" ]; then
 start_epoch=$(date -d "$active_since" +%s 2>/dev/null); [ -z "$start_epoch" ] && exit 0
 now_epoch=$(date +%s)
 uptime_seconds=$((now_epoch - start_epoch))
 if [ "$uptime_seconds" -lt 900 ]; then
 echo "vLLM has been up for ${uptime_seconds}s, within grace period, skipping health check"
 exit 0
 fi
fi

# Also skip if service is still in activating state (i.e., loading)
service_state=$(systemctl is-active vllm 2>/dev/null)
if [ "$service_state" = "activating" ]; then
 echo "vLLM is still activating, skipping health check"
 exit 0
fi

response=$(curl -s -o /dev/null -w "%{http_code}" --max-time 5 --connect-timeout 5 http://localhost:8000/health)
if [ "$response" != "200" ]; then
 echo "vLLM health check failed (HTTP $response), restarting service"
 systemctl restart vllm
fi

Add a cron job to run this every minute:

bash
chmod +x /usr/local/bin/vllm-healthcheck.sh
echo "* * * * * root /usr/local/bin/vllm-healthcheck.sh >> /var/log/vllm-healthcheck.log 2>&1" \
 | sudo tee /etc/cron.d/vllm-healthcheck

Monitoring Setup

For GPU monitoring details including DCGM and Grafana setup, see GPU monitoring with Prometheus and Grafana.

Add a Prometheus scrape job for vLLM metrics:

yaml
# prometheus.yml (add to scrape_configs)
scrape_configs:
 - job_name: 'vllm'
 scrape_interval: 10s
 static_configs:
 - targets: ['localhost:8000']
 metrics_path: '/metrics'

Three alerts to configure from the start:

yaml
# vllm_alerts.yml
groups:
 - name: vllm
 rules:
 - alert: VLLMQueueDepth
 expr: vllm:num_requests_waiting > 20
 for: 2m
 labels:
 severity: warning
 annotations:
 summary: "vLLM request queue is backing up"

 - alert: VLLMKVCachePressure
 expr: vllm:gpu_cache_usage_perc > 0.90
 for: 5m
 labels:
 severity: warning
 annotations:
 summary: "vLLM KV cache above 90%, consider scaling"

 - alert: VLLMHighTTFT
 expr: histogram_quantile(0.95, sum(rate(vllm:time_to_first_token_seconds_bucket[5m])) by (le)) > 2.0
 for: 3m
 labels:
 severity: critical
 annotations:
 summary: "vLLM p95 TTFT above 2s SLA"

Load Balancing Two Instances

For redundancy or throughput beyond a single GPU, add an Nginx upstream block:

nginx
# /etc/nginx/sites-available/vllm
upstream vllm_backend {
 least_conn;
 server 10.0.0.1:8000;
 server 10.0.0.2:8000;
}

server {
 listen 80;

 location / {
 proxy_pass http://vllm_backend;
 proxy_http_version 1.1;
 proxy_set_header Connection "";
 proxy_read_timeout 300s;
 proxy_buffering off;
 }
}

least_conn is the right load balancing strategy for LLM inference because request duration varies significantly. Round-robin can pile long requests onto one backend while the other sits idle.

Phase 4 Architecture Diagram

Client
 |
 v
Nginx (least_conn load balancer, port 80)
 | |
 v v
vLLM-1 vLLM-2
(port 8000) (port 8000)
 | |
 v v
GPU-1 GPU-2
 | |
 v v
Prometheus Prometheus
(scrape /metrics every 10s)

Phase 5: Scale

Single-instance is fine for development and low-traffic production. Phase 5 is for when your traffic outgrows it.

For GPU cost reduction strategies that apply across all phases, see GPU cost optimization strategies.

Horizontal Scaling: When and How

Calculate how many instances you need:

instances_needed = ceil(peak_rps / single_instance_rps)

Example: if your load test showed a single L40S handles 12 requests/second at p95 TTFT under 500ms, and your peak traffic is 60 requests/second, you need:

ceil(60 / 12) = 5 L40S instances

Add 20-30% buffer for traffic spikes: plan for 6-7 instances.

For stateless LLM inference (no session affinity), horizontal scaling is straightforward. Each instance runs an independent vLLM server. Nginx distributes load across all of them. No shared state to coordinate.

Model Sharding for 70B+ Models

When a model does not fit on a single GPU even in FP8, use tensor parallelism to split it across multiple GPUs on the same host:

bash
# 70B model in FP16 across 2x H100 80GB
docker run --gpus all \
 --ipc=host \
 -p 8000:8000 \
 vllm/vllm-openai:latest \
 --model meta-llama/Llama-3.1-70B-Instruct \
 --dtype float16 \
 --tensor-parallel-size 2 \
 --gpu-memory-utilization 0.90 \
 --max-num-seqs 128

When to use tensor parallelism vs adding more single-GPU instances:

  • Tensor parallelism: when the model does not fit on one GPU, or when you want to reduce TTFT (prefill is parallelized across GPUs).
  • More instances: when throughput is the bottleneck and the model fits on one GPU. Multiple independent instances scale throughput linearly with no NVLink overhead.

For 140B+ models, consider pipeline parallelism (--pipeline-parallel-size) in addition to tensor parallelism. Pipeline parallelism assigns different transformer layers to different GPUs, which works better when NVLink bandwidth is a bottleneck.

Spot Instances for Cost Reduction

For production inference on a stable traffic pattern, spot instances cut costs significantly. On Spheron, spot pricing for H100 and L40S on-demand varies by availability, but savings typically run 50-70% versus on-demand:

GPUOn-Demand $/hrSpot $/hr (approx.)Monthly Savings (1 instance)
H100 SXM5 80GB$2.40~$0.80-1.00~$1,008-1,152
L40S 48GB$1.80~$0.30-0.35~$1,044-1,080
A100 SXM4 80GB$1.05~$0.40-0.55~$360-468

To use spot with vLLM safely, the service must handle interruptions gracefully. LLM inference is stateless (no in-flight requests survive an interruption), so the main concern is in-flight requests at the moment of preemption. A reasonable approach:

  1. Run spot instances behind the Nginx load balancer.
  2. Set a short connection drain timeout (10-30 seconds) so Nginx stops sending new requests to a preempting instance.
  3. Keep at least one on-demand instance in the pool for reliability. Blend spot and on-demand based on your availability tolerance.

Cost at Scale: Projections

TierConfigurationOn-Demand $/monthSpot $/month (approx.)
Dev / Low traffic1x L40S$518N/A
Small production2x H100 SXM5$3,600~$1,483
Scale production4x H100 SXM5 (spot blend)$7,200~$2,966

See H100 GPU rental and View all GPU pricing for current rates before budgeting.

Pricing fluctuates based on GPU availability. The prices above are based on 15 Apr 2026 and may have changed. Check current GPU pricing → for live rates.


Related Resources

The five phases above each have deeper coverage in related posts:


Every phase in this guide runs on Spheron, from a single L40S for vLLM validation to multi-GPU H100 configurations for production. Per-minute billing means you only pay for the phases you are in, not idle capacity between them.

Spheron H100 → | On-demand L40S → | View all GPU pricing → | Get started on Spheron →

STEPS / 05

Quick Setup Guide

  1. Phase 1: Run the model locally with Ollama

    Install Ollama on your local machine and pull your target model with `ollama pull`. Run it with `ollama run` and send test prompts to measure output quality, latency, and behavior at your target context length. This phase costs nothing and lets you validate model selection before spending on cloud compute. Move to Phase 2 when you need concurrent users, faster inference, or larger models than your local hardware supports.

  2. Phase 2: Move to a cloud GPU and validate with real traffic

    Provision a GPU instance on Spheron that matches your model size: L40S at $0.72/hr for 7B-30B models, H100 at $2.50/hr for 70B models. Deploy vLLM via Docker with the OpenAI-compatible server. Run a load test using locust or wrk to measure throughput (tokens/sec), latency (TTFT and inter-token latency), and error rate under your expected concurrent user count. This phase surfaces memory pressure, batching limits, and any model-specific configuration issues before you invest in production hardening.

  3. Phase 3: Benchmark and tune your inference engine

    Compare vLLM against llama.cpp and TGI on your specific model and workload. Run the same load test against each engine and measure throughput, p50/p95 latency, VRAM usage, and operational complexity. For most production deployments, vLLM wins on throughput and operational simplicity. Tune vLLM flags: --gpu-memory-utilization 0.90, --max-num-seqs for your concurrency target, --dtype fp8 on H100/Blackwell for a 1.5-2x throughput boost, and --performance-mode throughput for batch-heavy workloads.

  4. Phase 4: Harden for production

    Wrap vLLM in a systemd service unit with automatic restart on failure. Add a health check script that polls the /health endpoint and triggers a service restart if it returns non-200. Configure Prometheus to scrape vLLM's /metrics endpoint and set alerts on queue depth (vllm:num_requests_waiting), KV cache fill rate (vllm:gpu_cache_usage_perc), and time-to-first-token (vllm:time_to_first_token_seconds). For more than one instance, add an Nginx upstream block to distribute load across instances.

  5. Phase 5: Scale across instances

    Calculate the number of instances you need using the formula: instances = ceil(peak_rps / single_instance_rps). For 70B+ models that exceed single-GPU VRAM even in FP8, use --tensor-parallel-size to split the model across multiple GPUs on the same host. For cost reduction on stable workloads, move to spot instances with checkpoint-based restart: spot H100s on Spheron cost roughly 60% less than on-demand, and a properly configured vLLM service recovers automatically after a spot interruption.

FAQ / 05

Frequently Asked Questions

Start by running the model locally with Ollama to validate output quality and latency. Then provision a GPU cloud instance on Spheron (L40S at $0.72/hr for 7B-30B models, H100 at $2.50/hr for 70B), deploy vLLM via Docker, and run load tests to measure real throughput. Once you hit your latency targets, add a systemd service for auto-restart, configure Prometheus monitoring on vLLM's /metrics endpoint, and set up an Nginx load balancer if you need more than one instance. Total time from zero to a hardened single-instance deployment: 4-8 hours.

Use FP8 quantization on a single H100 SXM5 80GB at $2.50/hr on-demand, which gets 70B weights to ~70GB and fits on one GPU. Enable with --dtype fp8 in vLLM. For sustained production traffic, spot instances on Spheron at $1.03/hr can reduce costs significantly with checkpoint-based restart on interruption. At 720 hours per month, on-demand costs $1,800/month for a single 70B inference instance, while spot runs $741.60/month. Scaling to two spot H100s for redundancy runs $1,483/month versus $3,600/month for two on-demand H100s.

Ollama is for local development and prototyping. It is easy to install, runs on consumer GPUs and CPUs, and is great for testing a model before committing to a cloud deployment. It does not support PagedAttention-based continuous batching or multi-GPU tensor parallelism, which means it saturates quickly under concurrent load. vLLM is for production. It supports continuous batching (handles hundreds of concurrent requests efficiently), FP8 quantization on H100 and Blackwell GPUs, multi-GPU tensor parallelism, and exposes a Prometheus metrics endpoint. Use Ollama in Phase 1 to pick your model, then switch to vLLM in Phase 2 when you move to cloud.

This depends on your model size, average request length, and acceptable latency. A practical estimate: a single H100 running vLLM with a 7B model handles roughly 80-120 concurrent requests at under 200ms TTFT (time to first token). For 1,000 concurrent users with a 7B model, plan for 8-12 H100 instances or use a larger model on fewer GPUs with higher per-request latency. For a 70B model, throughput drops to 15-30 concurrent requests per H100, so 1,000 concurrent users needs 35-70 H100 instances. Load test with your specific model and target SLA before sizing for production.

On Spheron, an L40S instance (48GB VRAM, sufficient for 7B-30B models in FP16) costs $0.72/hr on-demand as of April 2026. At full utilization running 24/7, that is $518/month. A 7B model on an L40S with vLLM handles roughly 80-120 concurrent requests and produces 2,000-4,000 tokens/second. At $0.72/hr and 3,000 tokens/second average throughput, the cost per million tokens is approximately $0.067. For context, the same throughput on AWS would cost roughly $2-4/hr for a comparable GPU, making the cost per million tokens 3-6x higher.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.