Voozh

GPU clouds offer three billing models. Pick the wrong one and you'll overpay by 2-5x. Serverless charges per call with no idle cost. On-demand bills by the second, minute, or hour for a dedicated instance, depending on the provider. Reserved locks you in for months at a steep discount. The right answer depends entirely on your workload pattern.

For provider-specific GPU pricing comparisons, see our GPU cloud pricing comparison. For inference-specific GPU selection guidance, see best GPU for AI inference in 2026. For a full cost-per-million-token analysis using the token factory framework, including spot vs on-demand CPM tables, see the token factory guide.

The Three Billing Models

When Should You Use Serverless GPU?

Serverless GPU platforms abstract the hardware entirely. You submit a request or function call, the platform provisions a GPU, runs your code, and bills per inference call or per compute-second. You pay nothing when there is no traffic.

Pros: Zero idle cost, no instance management, scales to zero automatically.

Cons: Cold starts ranging from 200ms-4s for small models on optimized platforms (Modal, RunPod FlashBoot) to 6-60s for large LLM deployments, depending on container size and caching. No hardware control. Limited to what the platform supports. Not available at multi-GPU scale (you cannot run an 8xH100 job serverlessly on most platforms).

Best for: Async batch jobs where cold starts are acceptable, prototyping, demos, and situations where zero instance management is worth paying more for. For low-traffic APIs processing 100-500 requests/day, per-second on-demand billing (Spheron, RunPod) is 80% cheaper than serverless ($3.35/month vs $16.67/month) and avoids cold starts entirely. Serverless makes sense here only if you want zero instance management and can tolerate cold starts of 200ms-4s for small models or up to 6-60s for large LLMs, and are willing to pay 5x more per month for that convenience.

Providers: Modal, Replicate, RunPod Serverless, Baseten, Fal AI.

When Should You Use On-Demand GPU?

On-demand rents a dedicated GPU instance that stays running until you stop it. You get full hardware control, consistent throughput, and instance startup in under 2 minutes. You are billed whether the GPU is busy or idle. Billing granularity varies by provider: Spheron bills per second, Lambda Labs per hour, Vast.ai per minute.

Pros: Full hardware control, predictable throughput, no cold starts, works at any scale including multi-GPU clusters.

Cons: You pay for idle time. If you provision an H100 for 24 hours but only use it 3 hours, you pay for 24.

Best for: Training runs, sustained inference serving, interactive development, any workload requiring consistent throughput.

Providers: Spheron, RunPod, Lambda Labs, CoreWeave, Vast.ai.

When Should You Use Reserved GPU?

Reserved pricing on hyperscalers typically requires committing to 1 or 3-year contracts in exchange for discounts vs on-demand. GPU reserved discounts typically range from 30-75%, with AWS H100 1-year reserved landing around ~$3.80/hr (roughly a 45% discount off the ~$6.88/hr per-GPU on-demand rate that followed the June 2025 P5 price cut). Some neo-cloud providers (GPU-specialized cloud providers like Spheron, RunPod, Lambda Labs) offer reserved pricing with shorter commitment windows. CoreWeave offers negotiated reserved pricing with minimum commitment periods for discounted rates (exact terms vary by contract), for example. You pay the reserved rate every month regardless of actual usage.

Quick rule: Reserved makes sense when your utilization exceeds (1 - discount percentage). At a ~45% AWS discount, you need 55% utilization to break even. See full calculations below.

Pros: Large discounts for 24/7 workloads. Cost predictability for budget planning.

Cons: Contract commitment. You pay even if you do not use the GPU. Reserved rates on hyperscalers still exceed neo-cloud on-demand pricing for the same GPU.

Best for: Production workloads running 24/7 on hyperscalers where you are already invested in the AWS/GCP/Azure ecosystem.

Providers: AWS (EC2 reserved), GCP (committed-use contracts), Azure (reserved VMs), CoreWeave (minimum commitment required for discounted rates; terms vary by contract).

Spot Pricing: A Hybrid Option

Spot GPUs are excess capacity sold below the on-demand rate. Providers including Spheron, RunPod, and Vast.ai offer spot instances at variable discounts depending on availability. Spot instances can be interrupted with short notice (2 minutes on AWS, 30 seconds on GCP/Azure), so they require checkpointing or fault-tolerant job design. The 2026 GPU shortage has made spot instances the primary access strategy for teams that missed reserved capacity windows on hyperscalers, since on-demand H100 availability on AWS and GCP has become unreliable for teams without pre-existing reservations.

Spot pricing is not a separate billing model, but it sits between on-demand and reserved. You get below-on-demand rates without a contract, at the cost of interruption risk. Best for training jobs and batch inference that checkpoint state regularly.

Comparison Table

Compared	Serverless	On-Demand	Reserved
Billing unit	Per call / second	Per second / minute / hour (varies by provider)	Monthly flat (committed)
Cold start	200ms-4s (small/optimized); 6-60s+ (large LLMs)	<2 minutes	None
Idle cost	Zero	Full rate	Full rate (committed)
Contract	None	None	Varies (months to years)
GPU control	None (abstracted)	Full	Full
Best for	Intermittent / async	Training / sustained serving	Predictable 24/7 load

Cost Modeling for Four Workload Types

These scenarios use real pricing to show which billing model wins in practice. Serverless rates are approximate since providers change them frequently. Modal's published per-second rate is $0.002778/GPU-second ($9.99/hr). Under sustained load with warm containers and keep-alive optimization, effective rates can be lower ($3.95-$4.76/hr). The examples in this post use the published per-second rate to show worst-case serverless costs. Check Modal's pricing page for current rates.

Low-Traffic Inference API

Setup: 100 requests/day, each needing 2 seconds of H100 compute. Total compute: 200 seconds/day = 3.33 minutes.

Serverless (Modal, ~$0.002778/GPU-second): 200s x $0.002778 = $0.5556/day = $16.67/month
On-demand H100 PCIe (provider with per-minute minimum billing, e.g., Vast.ai, ~$1.50/hr): 100 requests x 1-min minimum = $2.50/day = $75/month
On-demand H100 PCIe per-second billing (Spheron, $2.01/hr): 3.33 min x $0.0335/min = $0.1116/day = $3.35/month

Winner: Per-second on-demand (Spheron, $3.35/month). It is 80% cheaper than serverless ($16.67/month) and avoids cold starts entirely. Serverless makes sense here only if you want zero instance management and can tolerate cold starts of 200ms-4s for small models or up to 6-60s for large LLMs, and are willing to pay 5x more per month for that convenience. Per-request billing with 1-minute minimums costs ~23x more than per-second on-demand and should be avoided for this workload type.

24/7 Inference Serving

Setup: Sustained production traffic requiring one H100 full-time, 720 hours/month.

On-demand H100 PCIe (Spheron, $2.01/hr): $2.01 x 720 = $1,447/month
On-demand H100 SXM (AWS p5.48xlarge, per-GPU): ~$6.88/hr x 720 = $4,954/month (reflects June 2025 44% AWS price reduction, down from ~$12.29/hr)
Reserved H100 (AWS, 1-year effective rate): ~$3.80/hr x 720 = $2,736/month

Winner: Spheron on-demand at $1,447/month is 71% cheaper than AWS on-demand at $4,954/month and still 47% cheaper than AWS 1-year reserved at $2,736/month, with no contract required. For fault-tolerant workloads, spot pricing on available GPUs can reduce costs further.

Short Training Run (7 Days)

Setup: Full H100 PCIe for 168 hours. For GPU selection guidance on training, see best Nvidia GPUs for LLMs.

On-demand H100 PCIe (Spheron, $2.01/hr): $2.01 x 168 = $337.68
On-demand H100 SXM (AWS p5, per GPU): ~$6.88/hr x 168 = $1,155.84 (reflects June 2025 AWS price reduction)
AWS 1-year reserved (effective rate): ~$3.80/hr x 168 = $638.40

Winner: Neo-cloud on-demand (Spheron). Spheron on-demand ($337.68) is 71% cheaper than AWS on-demand ($1,155.84) and still 47% cheaper than the AWS 1-year reserved effective rate, without any contract. For short training jobs, there is no reason to sign a reserved contract.

Monthly Burst Workload

Setup: Need 8x H100 for 4 hours, once a month.

Serverless: Not available at 8-GPU scale on most platforms.
On-demand (Spheron): 8 x $2.01 x 4 = $64.32/month
Reserved (AWS 1-yr, 8x H100): 8 × $3.80 × 720 = ~$21,888/month, committed whether used or not.

Winner: On-demand by a large margin. Reserved pricing makes no sense for burst workloads.

For hybrid inference architectures that use cloud GPU as a burst tier alongside on-device models, the hybrid cloud-edge inference decision guide covers the routing logic and cost math in detail. Teams with a DGX Spark local machine typically use on-demand cloud instances for production, keeping dev costs near zero. See the DGX Spark local-to-cloud guide for a concrete cost model.

Provider Examples by Billing Model

Serverless GPU Providers

Provider	Pricing	Notes
Modal	Varies by GPU (see Modal pricing page)	Wide GPU selection, fast cold starts (2-4s)
Replicate	Per prediction	Model-specific pricing, 16-60s+ cold starts on custom models
RunPod Serverless	Per second	FlashBoot achieves 200ms-2s cold starts for optimized containers; large model deployments still see 6-12s+
Baseten	Per call	Enterprise inference platform with private VPCs and SLAs; pricing by contract
Fal AI	Per second	Optimized for image/video generation (Flux, SDXL); sub-second inference on popular models

Serverless GPU prices change frequently. Baseten and Fal AI do not publish standard pricing. Treat these as approximate and check their pricing pages or contact sales for current rates.

Fireworks AI is another serverless inference option targeting open-weight models; see the Fireworks AI alternatives comparison for a full pricing breakdown against dedicated GPU.

On-Demand GPU Providers

Provider	H100 On-Demand $/hr	Spot $/hr	Billing unit
Spheron	$2.01 (PCIe)	Variable (select GPUs)	Per second
RunPod	$2.69	Available	Per second
Lambda Labs	$2.49 (PCIe) / $2.99 (SXM, per-GPU rate in 8xH100 config)	N/A	Per hour
CoreWeave	~$4.76 (GPU component only; ~$6.15/GPU bundled with CPU/RAM)	N/A	Per hour
Vast.ai	$1.35-$1.53	Available	Per minute
AWS	~$6.88 per GPU on p5.48xlarge (post-June 2025 44% cut)	Variable (check AWS console for current rates)	Per second (1-min min)
GCP	~$3.00-$9.80 (significant regional variance; significant price fluctuations in 2025-2026; verify current rates)	Variable (check GCP console for current rates)	Per second (1-min min)

Pricing fluctuates based on GPU availability. The prices above are based on 23 Mar 2026 and may have changed. Check current GPU pricing for live rates.

For a broader look at the on-demand providers beyond just H100 pricing, see our top 10 cloud GPU providers guide.

Provider availability notes: Lambda Labs H100 availability can be limited during peak demand; check current availability before budgeting. Vast.ai is a marketplace, so pricing is volatile and reliability varies by host. GCP H100 on-demand pricing has seen significant fluctuations in 2025-2026; verify current rates before making cost comparisons.

Reserved GPU Providers

Provider	H100 On-Demand $/hr	H100 Reserved (1yr effective)	Discount
AWS	~$6.88 per GPU on p5.48xlarge (post-June 2025 44% cut)	~$3.80	~45%
GCP	~$3.00-$9.80 (regional variance; verify current rates)	~$4.00 (estimated)	~59% (vs US standard rate; verify for your region)
Azure	~$6.98 (single-GPU NC40ads_H100_v5, East US) / ~$12.29 per GPU (8-GPU ND96isr H100 v5)	~$5.50 (estimated; not publicly listed)	~55% (est.)
CoreWeave	~$4.76 (GPU component only)	Negotiated (contact sales)	CoreWeave offers reserved clusters at negotiated rates but does not publish these rates publicly; contact their sales team for custom quotes

Azure H100 pricing varies by VM type: $6.98/hr for single-GPU VMs (NC40ads_H100_v5, East US), $12.29/hr per GPU for 8-GPU configurations (ND96isr H100 v5). Pricing is highly region-dependent; verify rates for your target region before budgeting.

For Spheron, volume and reserved pricing is available - contact sales via app.spheron.ai or email. Spheron does not publish fixed reserved rates, but its on-demand rate ($2.01/hr for H100 PCIe) matches AWS 1-year reserved pricing (~$2.00/hr), with no contract commitment required.

When Serverless GPU, On-Demand, or Reserved Saves You Money

Workload type	Daily GPU hours	Cheapest model	Notes
Low-traffic API	<0.5 hr equivalent	Per-second on-demand	80% cheaper than serverless at 100 req/day with per-second billing ($3.35/month vs $16.67/month); choose serverless only if zero instance management matters more than cost
Dev / test	0.5-3 hr	Per-second on-demand	Stop when idle
Batch jobs	2-8 hr	Spot on-demand	Use checkpointing
Production inference	12-24 hr	Spot or on-demand neo-cloud	Beats hyperscaler reserved
Long-term 24/7 production	24 hr for 6+ months	Reserved (if on hyperscaler)	Only if already in AWS/GCP/Azure

What Is the Breakeven Point for Reserved vs On-Demand?

The simplified breakeven rule: breakeven utilization = 1 - (discount percentage). At a ~45% AWS discount, you break even at 55% utilization. Below that, on-demand is cheaper.

Reserved makes sense when:
(Reserved monthly cost) < (On-demand rate x actual hours used per month)

Example 1 (AWS H100 SXM per GPU on p5.48xlarge, 720-hour month, ~$6.88/hr on-demand post-June 2025 reduction):
$3.80/hr x 720 = $2,736 (AWS 1-yr reserved effective)
vs. $6.88/hr x X hours = $2,736
X = 398 hours -> You need to use it more than 398 hours/month
(55% utilization) to break even on AWS reserved vs AWS on-demand

But compare to Spheron on-demand:
$2.01/hr x 720 = $1,447/month (no contract required)
Spheron on-demand at $1,447/month beats AWS 1-year reserved by ~$1,289/month, with no contract.

Example 2 (GCP H100, 720-hour month, ~59% reserved discount vs $9.80/hr US standard rate):
$4.00/hr x 720 = $2,880 (GCP 1-yr reserved effective, estimated)
vs. $9.80/hr x X hours = $2,880
X = 294 hours -> You need 41% utilization to break even on GCP reserved vs GCP on-demand
But GCP reserved at ~$4.00/hr still costs $2,880/month vs Spheron on-demand at $1,447/month.
Switching to a neo-cloud provider beats signing a GCP reserved contract.

The key insight: AWS H100 on-demand pricing dropped to ~$6.88/hr per GPU (on p5.48xlarge) in June 2025 after a 44% cut from ~$12.29/hr. AWS 1-year reserved runs ~$3.80/hr effective. Spheron on-demand at $2.01/hr undercuts AWS reserved rates without any contract commitment. For workloads running less than 24/7, the comparison favors Spheron even more strongly. To track actual GPU utilization and avoid paying for idle time, see GPU monitoring best practices.

Spheron's Billing Model

Spheron bills on-demand GPU rentals by the second with no hourly minimum. Spot pricing is available on select GPUs for fault-tolerant workloads at variable rates depending on current GPU availability. There are no contracts, no egress fees, and no reserved commitment required.

GPU	Spheron On-Demand	AWS On-Demand	AWS 1-yr Reserved
H100 SXM (per GPU on p5.48xlarge)	$2.50/hr (SXM5)	~$6.88/hr	~$3.80/hr
A100 80G PCIe	$1.07/hr	~$3.43/hr	~$2.00/hr (est.)
A100 80G SXM4	$1.14/hr	~$3.43/hr	~$2.00/hr (est.)

Spheron pricing as of March 23, 2026. Prices fluctuate based on GPU availability. Check current Spheron pricing for live rates.

Spheron H100 PCIe on-demand ($2.01/hr) matches AWS H100 1-year reserved pricing (~$2.00/hr). You get equivalent pricing with no contract or commitment. For billing details and instance types, see docs.spheron.ai/billing.

For H100 rental, A100 rental, H200 rental, and other GPU options, Spheron offers per-second billing with no upfront commitment.

Decision Framework

Work through these questions in order:

Low-traffic API (fewer than 200 requests/day)? Per-second on-demand is cheaper than serverless at this traffic level if you use a provider that bills by the second with no hourly minimum. At 100 requests/day with 2-second compute each, per-second on-demand costs $3.35/month vs $16.67/month for serverless (80% cheaper). Choose serverless only if you want zero instance management and can tolerate cold start latency of 200ms-4s for small models or 6-60s for large LLMs, and are willing to pay 5x more for that convenience.

Fault-tolerant workload (checkpointed training, batch inference)? Use spot GPU when available. Save on compute costs with no contract. See our GPU cost optimization playbook for checkpoint strategies. Teams running offline document processing pipelines can combine spot instances with batch inference patterns for the lowest possible cost-per-token.

Running 24/7 on AWS or GCP already? Reserved may be worth calculating. But compare the reserved rate to neo-cloud on-demand first. You may save more by switching providers than by signing a contract.

Everything else (training, sustained serving, dev/test, burst workloads)? On-demand with per-second billing. Stop the instance when you are done. No idle waste, no contract risk.

The most common mistake is defaulting to hyperscaler reserved pricing without comparing to neo-cloud on-demand. AWS H100 reserved at ~$2.00/hr effective requires a 1-year commitment but is now comparable to Spheron H100 PCIe on-demand at $2.01/hr. The difference: Spheron requires no contract. For workloads running less than 24/7, on-demand is always the better choice.

Serverless GPU Providers Compared

If serverless is the right billing model for your workload, the next question is which provider to use. Five platforms dominate the serverless GPU space in 2026, each with different tradeoffs on cold start, pricing model, and ergonomics:

Provider	Pricing model	Typical cold start	Best for
Modal	Per-second compute + per-second container	200ms-2s small models, 6-30s for LLMs	Python-native deployments, FlashBoot-enabled containers, teams that want Pythonic deploy DX
RunPod Serverless	Per-second worker active time	200ms-4s small, 10-60s for large LLMs	OpenAI-compatible LLM endpoints, vLLM workers, teams already on RunPod
Replicate	Per-second (model-specific rates)	5-30s for image diffusion, 30-90s for video	Hosting open-source models with a public API, one-off generative AI APIs
Baseten	Per-replica-hour billing on serverless tier	10-40s for LLM cold start	Truss-based ML model serving, teams already using their MLOps stack
Cerebrium	Per-second compute	5-15s typical	Serverless inference with simpler config than Modal

Where serverless wins on dollars: traffic under roughly 30% sustained GPU utilization. Below that threshold, you pay only for active inference seconds plus container startup overhead, and the fully-loaded cost beats a dedicated on-demand GPU running idle most of the day. Above 30% utilization, dedicated on-demand on Spheron, RunPod, or any neo-cloud is cheaper because the per-second active billing on serverless platforms carries roughly 2-4x the per-second cost of equivalent dedicated GPU time.

Where serverless loses: long-running training jobs (the per-second rate compounds), workloads with strict latency SLAs (cold starts of 6-60 seconds break user-facing applications), and any deployment that needs persistent GPU state (cached models, KV cache reuse across requests). Spheron does not currently offer a serverless tier. For production inference at sustained traffic, dedicated on-demand or spot is the better economic fit.

The honest framing: serverless GPU providers solve a real problem (zero idle cost for spiky low-volume traffic) but at a structural per-second premium. The right play for most teams is dedicated on-demand for sustained workloads and serverless only for genuinely bursty endpoints below 30% utilization. For a more detailed breakdown of when each serverless platform's specific pricing math works, see Modal Alternatives, Replicate Alternatives, and Baseten Alternatives.

For on-demand and reserved GPU workloads, Spheron offers transparent per-second billing with no contracts or egress fees. H100 PCIe starts at $2.01/hr with no commitment required, matching AWS 1-year reserved pricing without the lock-in.
Check H100 availability → | A100 80GB on Spheron → | View all pricing →

FAQ / 06

Frequently Asked Questions

Serverless GPU charges per inference call or per compute-second with no idle cost, but has cold start latency ranging from 200ms-4s for small models on optimized platforms (Modal, RunPod FlashBoot) to 6-60s for large LLM deployments, depending on container size and caching. On-demand GPU rents a dedicated instance billed per second, minute, or hour depending on the provider (Spheron bills per second), starts in under 2 minutes, and has predictable throughput. Serverless suits low-traffic APIs; on-demand suits sustained inference or training.

The threshold depends on the discount. The formula is: breakeven utilization = 1 - (discount percentage). AWS P5 (H100 SXM) on-demand dropped 44% in June 2025 to ~$6.88/hr per GPU (p5.48xlarge at $55.04/hr / 8 GPUs). The 1-year reserved effective rate is around $3.80/hr, roughly a 45% discount, so you need about 55% utilization to make reserved cost-effective. Compare the reserved rate to neo-cloud on-demand pricing before committing: Spheron H100 PCIe at $2.01/hr undercuts AWS 1-year reserved with no contract required.

For workloads running fewer than 5-6 hours per day, serverless or per-second on-demand billing is cheaper than any reserved contract. A serverless platform with cold-start tolerance (batch jobs, async inference) costs the least. On-demand with per-second billing (like Spheron) is the next best option for interactive workloads.

Spheron bills per second with no minimums and no egress fees. Spot pricing is available on select GPUs at variable rates depending on availability. AWS and GCP bill per second with a 1-minute minimum for Linux instances, charge for egress, and require 1 or 3-year reserved contracts for discounts. Spheron has no contracts for on-demand use.

Spot GPUs are excess capacity sold at a discount versus on-demand rates. Discounts vary based on GPU type and current market availability. Spot instances can be interrupted with short notice (2 minutes on AWS, 30 seconds on GCP/Azure). Best for fault-tolerant training jobs that checkpoint regularly. On-demand GPUs are guaranteed to run until you stop them.

For most AI training runs, on-demand with per-second billing offers the best combination of cost, flexibility, and control. Short training jobs (hours to days) on a neo-cloud like Spheron cost less than equivalent AWS or GCP on-demand, and far less than hyperscaler reserved. For longer 24/7 training workloads on hyperscalers, reserved contracts can help, but compare the reserved rate to neo-cloud on-demand first since neo-cloud on-demand often beats hyperscaler reserved without any contract commitment.

URL: https://www.spheron.network/blog/serverless-gpu-vs-on-demand-vs-reserved/

⇱ Serverless vs On-Demand vs Reserved GPU: Choose the Right Billing Model (Save 40-80%) | Spheron Blog