📚 Related: Multi-GPU Setup Guide · Used RTX 3090 Guide · VRAM Requirements · GPU Buying Guide
Why your intuition is wrong
The pitch sounds clean: pool VRAM, run bigger models, scale by adding hardware. Two 24GB cards equals 48GB of memory. The math seems right.
Except GPUs in a multi-GPU setup don’t share memory. Each card has its own VRAM, connected by PCIe — which runs at 20-60× slower than the GPU’s internal memory bandwidth. RTX 3090 memory bandwidth is 936 GB/s. PCIe 4.0 ×16 maxes around 32 GB/s. The gap is enormous.
Two GPUs aren’t a bigger GPU. They’re two GPUs coordinating over a bottleneck.
Every layer that splits across cards forces activations across that PCIe link. Every token decode pays the cost. When the model genuinely fits on one card, the second card adds work without adding capability — that’s the source of the 3-6% slowdown. The math only flips when the model can’t fit on one card at all, and the overhead of using two becomes cheaper than the alternative (CPU offload at ~1 tok/s, or not running the model at all).
The 2026 twist: MoE is shrinking the “too big” class
When this question was easier to answer, “too big for one card” meant a dense 70B — Llama 3 70B at Q4 needed 40-45GB, no consumer single card had that, dual 3090s were the practical answer.
That math has shifted. The current top open-weight models are sparse Mixture-of-Experts: Qwen 3.6 35B-A3B, DeepSeek V4 Flash (284B total / 13B active), GLM-4.5. Huge total parameter counts, but only a small fraction activates per token.
The practical result: Qwen 3.6 35B-A3B runs on a single RTX 3090. Q4_K_M is about 21GB. UD-Q4_K_XL is 22.4GB and benches at ~100 tok/s on a 3090. A model that would have been “definitely multi-GPU territory” two years ago now fits on one card with room for context. (Setup details here.)
This isn’t a niche case. MoE is where the open-weights frontier is going. As the landscape shifts further toward sparse architectures, the class of models that require multi-GPU shrinks each quarter. The decision question gets harder to answer “yes” — more useful models fit on one card every release cycle.
The narrowing decision funnel
Work these in order. Each one is a chance to rule multi-GPU out.
1. Does your target model fit on one card at the quant you need? Pull it and try. If it loads and runs at the speed you can live with, you’re done — a faster single card (4090, 5090 at 32GB) is the upgrade path. Check VRAM math here if you’re not sure.
2. Is your target model an MoE? Check the math at Q4 or Q5 before assuming multi-GPU. Qwen 3.6 35B-A3B, GLM-4.5 quants, mid-tier DeepSeek variants — many fit on a single 24GB card and don’t benefit from splitting (MoE isn’t memory-bandwidth-bound the way dense is). If it fits, the second card is overhead.
3. Do you genuinely need >24GB on a dense model? The two real cases are 70B at Q4 (~40-45GB) and 32B at Q8 (~34GB). Both blow past 24GB. Before saying yes, test on a 27-32B dense or a 30-35B MoE — if the smaller class is sufficient, you don’t need 48GB.
4. Are you serving multiple concurrent users? Multi-GPU scales near-linearly for batch throughput. A single user doesn’t benefit beyond ~2 cards. A team-scale local AI server with 10+ concurrent requests does — each additional GPU adds KV cache space for more simultaneous conversations.
If you cleared 1, 2, and 3, and you’re a single user — multi-GPU isn’t the answer for you. If you said yes to 3 or 4, the next section is where the second card earns its place.
When it flips to yes
Two narrow bands.
Dense >24GB: 70B at Q4 or 32B at Q8. No consumer single card fits either. Dual 3090s land at 16-21 tok/s on 70B Q4 — usable for chat and development. Mac Studio 96GB+ is the unified-memory alternative if Apple Silicon fits your stack. Cloud API rental ($50-200/month for occasional use) is the third option if you don’t need 24/7 access. Test smaller first — if Qwen 3.6 27B or 35B-A3B does the job, the second GPU is overhead you don’t need.
Multi-user serving: A dual-3090 setup running vLLM with tensor parallelism scales well for concurrent requests. The cost-per-concurrent-user math favors multi-GPU once you’re past about three simultaneous active sessions. If you’re running a local AI server for a household, team, or small business, this is where the second card pays for itself.
What it actually costs
One consolidated table — all-in for the first year, including the GPU, PSU, and electricity at typical usage. Prices reflect canonical used-market ranges as of June 10, 2026.
| Setup | All-in cost (year 1) | Performance | Use case |
|---|---|---|---|
| 1× RTX 3090 (24GB) | ~$1,000 | 35-40 tok/s on 32B Q4; ~100 on 35B-A3B MoE | Covers 32B dense + MoE up to ~35B total |
| 1× RTX 4090 (24GB) | ~$2,200-2,600 | Same VRAM as 3090, 40-90% faster decode | Same coverage with more speed |
| 1× RTX 5090 (32GB) | $1,999 MSRP, street usually higher | 32GB single card; fits tighter Q5/Q6 on 32B | New flagship if 32GB matters |
| 2× RTX 3090 (48GB) | ~$2,500-3,000 | 16-21 tok/s on 70B Q4; multi-user via vLLM | 70B+ dense or multi-user serving |
| Mac Studio 96GB+ | $3,000+ | Slower per-token, fits 70B-class in unified memory | If you want 48GB+ usable without multi-GPU |
| Dual 3060 12GB | ~$650 | 18-22 tok/s on 32B Q4 | Worse than single 3090 at every metric |
The second 3090 earns its money in the narrow yes-band: dense >24GB or multi-user serving. The dual-3060 path is the one that looks cheap and isn’t — same total VRAM as a single 3090, half the bandwidth, PCIe overhead on top.
Power, PSU, and case clearance factor in. A second card adds 200-350W under load, needs a 1,200W PSU, and adds $175-300/year in electricity at 24/7 use. The headline GPU price isn’t the full bill.
Software support
Once the hardware decision is made, the tooling shape matters. Not every framework handles multi-GPU the same way.
| Tool | Multi-GPU | Mixed Sizes | Parallelism | Notes |
|---|---|---|---|---|
| Ollama | Auto since v0.11.5 (stable 0.30.6) | Yes | Pipeline default; experimental tensor-parallel (PR #19378) | Zero config, just works |
| llama.cpp | Yes | Yes (--tensor-split) | Both | Most control, best for mixed GPUs |
| vLLM | Yes | Pipeline only | Both | Tensor parallel requires matched VRAM |
| ExLlamaV2 | Yes | Yes (--gs) | Tensor (v0.3.2+) | Fast for EXL2 quantizations |
| Razer AIKit | Yes (wraps vLLM) | Via vLLM rules | Both | Turnkey Docker stack |
| Exo | Apple Silicon only | No | Layer sharding | Mac-only distributed inference |
For zero config, Ollama splits automatically. For mixed-size cards, llama.cpp’s --tensor-split gives the most control. For multi-user serving with matched cards, vLLM with tensor parallelism is the production path. See llama.cpp vs Ollama vs vLLM for the deeper comparison.
The decision
The case for multi-GPU keeps shrinking. MoE models put what used to be “multi-GPU territory” inside a single 24GB envelope, and the narrowing band where two cards beat one is mostly dense 70B+ and multi-user serving.
If you’re in that band, dual 3090s at $1,700-2,000 for the pair remain the most practical 48GB-on-used-hardware path. If you’re not, a faster single card — or the same card you already have — is the better spend.
For configuration once you’ve decided, see the multi-GPU setup guide. For card selection, see the GPU buying guide and best used GPUs for 2026. For distributed inference across machines rather than two cards in one, mycoSwarm is the project to look at.
Get notified when we publish new guides.
Subscribe — free, no spam