Spheron GPU Catalog

NVIDIA L40S GPU: 48GB Specs, Pricing & Rental. Rent L40S GPU from $0.61/hr

48GB GDDR6 ECC Ada Lovelace data center GPU with FP8 Tensor Cores. L40S GPU rentals tuned for inference, video, and visual AI.

At a glance

You can rent an NVIDIA L40S on Spheron starting at $0.61/hr per GPU per hour, the lowest live marketplace rate. Per-minute billing, no long-term contracts, and instances deploy in under 2 minutes across data center partners in multiple regions. Each card ships with 48GB of GDDR6 ECC memory, 4th generation Tensor Cores with FP8 support, 3rd generation RT Cores, and hardware AV1 encode. The L40S is purpose-built for production inference of 7B-30B LLMs, Stable Diffusion and SDXL serving, video transcoding pipelines, and mixed AI + graphics workloads where you need data center reliability without H100 pricing.

GPU ArchitectureNVIDIA Ada Lovelace

VRAM48 GB GDDR6 (with ECC)

Memory Bandwidth864 GB/s

NVIDIA L40S specifications

GPU Architecture

NVIDIA Ada Lovelace

VRAM

48 GB GDDR6 (with ECC)

Memory Bandwidth

864 GB/s

Tensor Cores

4th Generation

CUDA Cores

18,176

RT Cores

3rd Generation

FP32 Performance

91.6 TFLOPS

FP16 Performance

183.2 TFLOPS

INT8 Performance

733 TOPS

System RAM

128 GB DDR5

vCPUs

22 vCPUs

Storage

625 GB NVMe SSD

Network

PCIe Gen4

TDP

350W

NVIDIA L40S pricing

Provider	Price/hr	Savings
SpheronYour price	$0.96/hrDEDICATED$0.61/hrSpot	-
RunPod	$0.79/hr	1.3x more expensive
Lambda Labs	$1.29/hr	2.1x more expensive
CoreWeave	$1.89/hr	3.1x more expensive
AWS (g6e.xlarge)	$1.86/hr	3.0x more expensive

Custom & Reserved

Need More L40S Than What's Listed?

Large L40S clusters, custom configs, or guaranteed long-term capacity.

Reserved Capacity

Commit to a duration, lock in availability and better rates

Custom Clusters

8 to 512+ GPUs, specific hardware, InfiniBand configs on request

Supplier Matchmaking

Spheron sources from its certified data center network, negotiates pricing, handles setup

Need more L40S capacity? Tell us your requirements and we'll source it from our certified data center network.

Typical turnaround: 24–48 hours

When to pick the L40S

Scenario 01

Pick L40S if

You're running production inference for 7B-30B LLMs, SDXL serving, or video transcoding pipelines and need ECC + data center drivers without H100 pricing. Also the pick when you need FP8 support but don't need HBM bandwidth, and when AV1 hardware encode is on the requirements list.

Recommended fit

Scenario 02

Pick A100 80GB instead if

Your workload is training-heavy and bandwidth-bound. A100 has 2 TB/s HBM2e (vs 864 GB/s GDDR6 on L40S), making it faster for pre-training and fine-tuning. L40S wins at inference, A100 wins at training.

Recommended fit

Scenario 03

Pick RTX 4090 instead if

Your model fits in 24GB and you're running dev / testing workloads where ECC and multi-tenant isolation don't matter. RTX 4090 is roughly half the hourly rate of L40S.

Recommended fit

Scenario 04

Pick H100 instead if

You need HBM3 bandwidth (3.35 TB/s) or NVLink for multi-GPU tensor parallelism. H100 is the right pick for 70B+ inference or any training job where memory bandwidth is the bottleneck.

Recommended fit

NVIDIA L40S use cases

Use case / 01

Optimized

⚡

AI Inference at Scale

Run cost-effective inference workloads with 48GB memory and INT8 support for high-throughput production deployments.

Production LLM inference (up to 30B params)Multi-model servingRecommendation system deploymentReal-time classification APIs

Use case / 02

Optimized

🎬

Video Processing & Encoding

Use hardware-accelerated video pipelines for live streaming, transcoding, and video analytics at scale.

Live video transcodingCloud gamingVideo analyticsReal-time virtual production

Use case / 03

Optimized

🖼️

Visual Computing & Rendering

Combine AI acceleration with professional graphics capabilities for rendering and visualization workloads.

3D rendering workloadsVirtual desktop infrastructure (VDI)Architectural visualizationProduct design rendering

Use case / 04

Optimized

🔄

Mixed AI + Graphics Workloads

Take advantage of the L40S's unique combination of AI and graphics acceleration for next-generation creative and visual AI applications.

AI-powered video editingGenerative AI for visual contentNeural radiance fields (NeRF)Real-time style transfer

NVIDIA L40S benchmarks

METRIC 01

LLaMA 2 13B Inference

2,800 tokens/s

FP16 batch 32

METRIC 02

Stable Diffusion XL

32 img/min

1024x1024 FP16

METRIC 03

Video Transcoding

8x real-time

4K H.265 to H.264

METRIC 04

BERT Large Inference

6,200 seq/sec

INT8

METRIC 05

Ray Tracing Performance

3rd Gen RT Cores

hardware RT, A100 has none

METRIC 06

VDI User Density

3x more users

vs previous gen per GPU

Serve Llama 3.1 8B at FP8 on L40S

L40S's 48GB GDDR6 ECC and FP8 Tensor Cores make it a strong fit for production 7B-13B inference with heavy concurrency. vLLM gives you an OpenAI-compatible endpoint in one command.

bash

Spheron

01# SSH into your L40S instance02ssh root@<instance-ip>0304# Install vLLM05pip install vllm0607# Launch Llama 3.1 8B FP8 with high concurrency08vllm serve meta-llama/Llama-3.1-8B-Instruct \09 --quantization fp8 \10 --max-model-len 16384 \11 --max-num-seqs 64 \12 --gpu-memory-utilization 0.91314# Test the endpoint15curl http://localhost:8000/v1/completions \16 -H "Content-Type: application/json" \17 -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"Hello","max_tokens":50}'

For 30B models (Qwen 2.5 32B, Mixtral 8x7B at AWQ), FP8 weights still fit with room for KV cache at moderate batch sizes.

NVIDIA L40S guides and resources

01Read

GPU Cloud Benchmarks 2026

See how L40S performs against A100 and RTX 4090 in real-world benchmarks across GPU cloud providers.

02Read

Best NVIDIA GPUs for LLMs: Complete Ranking Guide

Where L40S fits in the GPU lineup for LLM inference, and when it's the right budget choice.

03Read

The GPU Cloud Cost Optimization Playbook

How to cut your AI compute bill by 60%, including when to pick L40S over pricier alternatives.

01Technical Brief

NVIDIA L40S Release Date and Cloud Availability

The NVIDIA L40S was announced at SIGGRAPH August 2023 as the data-center inference and visualization sibling of the workstation RTX 6000 Ada Generation, both built on the Ada Lovelace architecture. Production shipments began Q4 2023, with cloud availability rolling out through H1 2024. RunPod, Lambda Labs, CoreWeave, and the broader neo-cloud ecosystem had L40S capacity by mid-2024.

On Spheron the L40S is available with per-minute billing and no contract, deployed via data center partners. Live availability and pricing is on the pricing page. The L40S is the cost-efficient inference option for 7B-30B parameter LLM serving and Stable Diffusion XL pipelines; for larger models or distributed training, the H100, H200, or B200 is the step up.

02Technical Brief

L40S VRAM and Memory Bandwidth: 48GB GDDR6 ECC at 864 GB/s

The L40S ships with 48GB of GDDR6 ECC memory at 864 GB/s of bandwidth. The bandwidth is roughly 3.9x lower than the H100 SXM5 (3.35 TB/s HBM3), but the 48GB VRAM is 1.5x larger than the H100 80GB minus what FP16 weights consume. For 7B-13B model inference at low to moderate concurrency, the bandwidth gap to H100 is less impactful than the price-per-hour gap is, making the L40S the cost-efficient pick.

Where the 48GB VRAM matters: Llama 3.1 8B fits in FP16 with substantial KV cache headroom for high-concurrency serving, a 13B model fits in FP16 with smaller batches, a 30B-class model fits in INT4 quantization, and Stable Diffusion XL with multiple LoRA adapters and ControlNets runs without OOM. ECC support adds reliability for production serving that consumer GPUs lack. For 70B model inference or anything requiring NVLink for multi-GPU tensor parallelism, step up to the A100 80GB or H100 SXM5. For higher-volume single-card inference where FP4 is acceptable, the B200 spot tier is competitive.

FAQ / 11

NVIDIA L40S FAQ

The A100 is better suited for training workloads thanks to its HBM2e memory and higher memory bandwidth. The L40S, on the other hand, excels at inference and mixed AI+graphics workloads with its 48GB GDDR6 memory, 3rd generation RT Cores for ray tracing, and lower cost per hour. If your primary use case is inference or visual computing, the L40S offers significantly better value.

Yes, the L40S is excellent for LLM inference. With 48GB of GDDR6 memory, it can handle models up to 30B parameters comfortably. It delivers high throughput with INT8 and FP16 precision support, making it ideal for production LLM deployment at a lower cost than H100. For inference-heavy workloads, the L40S provides outstanding price-performance.

The L40S uniquely combines strong AI acceleration with professional graphics capabilities, including 3rd generation RT Cores for ray tracing and hardware video encode/decode. It is the only data center GPU that offers both powerful AI inference performance and full graphics capabilities, making it ideal for workloads that require both AI and visual computing, such as AI-powered video editing, generative visual content, and virtual production.

Yes, the L40S can handle training for small to medium-sized models effectively. However, its GDDR6 memory bandwidth is lower than HBM found in A100 and H100, so for large-scale training workloads, those GPUs are better choices. The L40S truly excels at inference, where its 48GB memory and strong INT8/FP16 performance provide excellent throughput at a competitive price.

The L40S features hardware NVENC/NVDEC engines supporting H.264, H.265, and AV1 codecs at up to 8K resolution. This makes it perfect for cloud gaming, live streaming, video transcoding, and video analytics workloads. The combination of AI acceleration and hardware video processing enables advanced use cases like real-time video analytics and AI-powered content creation.

The L40S has 48GB of memory compared to 24GB on the RTX 4090, along with ECC memory support and data center-grade reliability. This makes the L40S significantly better for production inference workloads where uptime and memory capacity matter. The RTX 4090 is a more affordable option for development and experimentation, but the L40S is the clear choice for deployment at scale.

There's no minimum! Spheron charges by the hour with per-minute billing granularity. Rent an L40S for just an hour to test your workload, or keep it running for months. You only pay for what you use with no long-term contracts or commitments.

Yes, the 48GB of GDDR6 memory allows you to run 2-3 smaller models (around 7B parameters each) or 1 larger model (up to 30B parameters) simultaneously. The L40S also supports NVIDIA MPS (Multi-Process Service) for efficient multi-process GPU sharing, enabling you to serve multiple models concurrently with optimized resource utilization.

L40S GPUs are currently available in US Region, Europe, and Canada. We're continuously expanding capacity and regions. Check our app or contact sales for specific region requirements.

Our platform is plug-and-play for standard deployments. For 100+ GPU clusters, you get dedicated support via Slack or Discord, plus sourcing assistance. Enterprise customers get dedicated support channels and SLA guarantees.

Book a call with our team→

Dedicated L40S instances are non-interruptible, run on a 99.99% SLA, and bill per-minute at the on-demand rate. Spot instances run on spare capacity at meaningfully lower rates but can be preempted when dedicated demand rises. Use spot for fault-tolerant workloads: batch inference, LoRA fine-tuning with checkpointing every 15-30 minutes, or video transcoding jobs that can resume. Use dedicated for customer-facing inference endpoints, live streaming pipelines, and any SLA-bound serving workload. Both tiers live in the same control plane, so you can mix them across a project.