Voozh

I spent a long time building the gaming PC I wanted, iterating over the last decade and finally landing on a PC that the younger me could have only dreamed of. I've got an Nvidia RTX 5090 and an AMD Ryzen 7 9800X3D, and it handles every game that I throw at it without breaking a sweat. On top of that, I do a lot of local heavy computational workloads, like machine learning, data analysis, and development.

However, as local LLMs have taken off, I've been playing around with them and seeing what they can do. I now run them every day, and while I had thought the RTX 5090 would be an incredible beast capable of running them at impossible speeds, I realized something very quickly: it's fast, but speed isn't all there is.

Granted, Qwen 3.6 27B is a phenomenal model, and it fits nicely in the 32GB of VRAM that the RTX 5090 has. But there are other, more interesting models that I'd love to try out, but those are significantly larger than what I can fit in a mere 32GB pool. Unfortunately, I've come to realize that Apple Silicon is probably the best mainstream way to get into big local LLMs right now, because the architecture massively benefits the workload in ways that I don't think even Apple expected when it first brought its Unified Memory Architecture to the market in 2020.

For the record, I'm not saying that you should go out and buy an Apple Silicon-based machine for local AI, nor am I saying that it's the only way to run local AI. But it's pretty funny that Apple, somewhat accidentally, settled on a memory architecture that positioned it as a better alternative to the best consumer GPUs in the world for a very specific purpose. Apple has also started building more explicit tooling for this world with MLX, its machine-learning framework for Apple Silicon. It's not a CUDA equivalent in maturity or scope, and plenty of local LLM tooling still uses Metal directly, but it shows Apple is aware that unified memory has become one of its strongest AI advantages.

32 GB isn't as high a ceiling as it sounds

Memory bandwidth doesn't matter if the model doesn't load

Credit: Source: der8auer

The 5090 ships with 32GB of GDDR7 on a 512-bit bus, good for around 1.79 TB/s of memory bandwidth. That's the most VRAM Nvidia has put on a consumer card, and the fastest memory bus they've ever shipped to gamers. On the small stuff, it's incredibly fast, and quantized 7B and 13B models run faster than I can read the output. Even a 30B model in 4-bit quantization sits in VRAM with room to spare.

What that bandwidth buys you only matters if the model fits. If the weights, KV cache, and context buffer don't fit in 32GB, the speed drops off massively. The model starts to offload to system RAM, and that massive bandwidth is suddenly bottlenecked by whatever your DDR5 can achieve. Squeezing a quantized Llama 3.3 70B onto the 5090 is possible with care, at Q3 and a tiny context window, but you'll have to work hard to achieve it.

Step up to something like Qwen3-Coder-Next at FP8, taking up 85GB of storage, and the 5090 isn't even in the same conversation anymore. However, that model is a mixture-of-experts with only 3B active parameters per token. With that said, the weights still have to fit somewhere, and 85GB will never fit in 32GB. You can offload some expert layers to system RAM which certainly helps, but it will still be slower. The reason you can offload this way and still have it be usable is the same reason it works so well on Apple's unified memory: generation is much lighter on bandwidth than it would be if every parameter in the model activated for every token.

Apple's M-series chips don’t separate VRAM from system RAM. The CPU and GPU can access the same unified memory pool, and local LLM runtimes can use that pool without copying weights across PCIe. On a maxed-out Mac Studio with M3 Ultra, that means up to 512GB the GPU can use directly. There's no PCIe round-trip or copying between pools, and even at the more consumer-friendly end of the lineup, it still holds true. A MacBook Pro with M4 Max scales to 128GB at 546 GB/s, four times the addressable memory of a 5090, in a laptop. A Mac Mini with M4 Pro tops out at 64GB, double the 5090, in a tiny machine.

You can even find M1 Max-based machines with 64GB of RAM for around $1000 depending on the used market, which can be a very reasonable cost depending on what you're buying it for, especially if local LLMs are only an incidental rather than the main goal. On top of that, given that the MSRP of the 5090 is $2000 (and it realistically costs a lot more than that right now), a single MacBook Pro or Mac Studio with twice the RAM could set you back less. And that's an entire computer for the cost of a single GPU. More on that in a bit, though.

At the very top, the gap isn't just lopsided, but instead, straight up absurd. The DeepSeek R1 671B model, the full thing, weighs in around 405GB once quantized to 4-bit. No 5090 runs that. Not even a four-5090 rig can keep it resident in VRAM either. However, Apple's 512GB Mac Studio M3 Ultra loads it at Q4 and draws roughly 160 to 180W during token generation. That's less than half the TDP of just one 5090.

Slower than the 5090, faster than impossible

A model that runs is better than a model that doesn't run at all

The 5090's speed advantage on the models that fit is a big deal. The M3 Ultra tops out at 819 GB/s of memory bandwidth against the 5090's 1.79 TB/s, and the M3 Ultra is by far the fastest Apple Silicon chip on this metric. For many models that fit entirely in 5090 VRAM, especially under CUDA-optimized runtimes, you can see roughly double the token generation speed of an M3 Ultra, depending on quantization, backend, and context length. For interactive work that needs to feel snappy, the 5090 wins.

Prompt processing widens the gap even further, as Apple Silicon's prefill is meaningfully slower than CUDA at long context. Apple's M5 series does improve it, but the time-to-first-token on a 30,000-token prompt still feels markedly worse on the Mac even when generation speed afterwards is fine. In other words, short prompts and long outputs will feel fine, but pasting an entire codebase into context will be noticeably slower.

However, this comparison suddenly flips in favour of Apple when the model doesn't even fit in 32GB. R1 is a mixture-of-experts model, so only around 37B parameters activate per token, which is why an 819 GB/s machine can serve a 671B model at usable speeds at all. Bandwidth pressure looks closer to a 37B dense model than a 671B one; a genuinely dense model of that size would crawl. With that caveat noted, the M3 Ultra runs DeepSeek R1 at roughly 15 to 20 tokens per second. That is slower than most would like for a reasoning model that uses a lot of tokens just to think, but the model runs and is usable. Given that the 5090 can't even run that model, it's a pretty good trade-off.

For small and medium models, the 5090 is faster and I prefer using it. For anything genuinely large, the Mac is the only one of my two machines that runs it at all. The question stops being which is faster and starts being which one does the thing I'm trying to do.

The price gets less ridiculous the more you look at it

It's cheaper than some of the most advanced clusters

A 512GB Mac Studio is not cheap. That configuration runs around $9,500 before you've added a keyboard, and that number buys you a decent gaming PC three or four times over. Kind of depends on RAM pricing at this point, to be honest.

However, the middle-ground is worth looking at. A pair of 5090s gets you to 64GB. A pair of used 3090s gets you to 48GB for a lot cheaper. A single RTX Pro 6000 Blackwell hits 96GB on one card. Any of those clears the 30B-to-70B class comfortably, and can reach into the 100B-ish tier depending on quantization and context, and for that tier they're genuinely competitive with a mid-spec Mac. With that said, PCIe hops between cards introduce latency that hurts long-context generation, and multi-GPU orchestration is its own software project to maintain. Plus, a four-5090 rig reaches 128GB at several times the wattage of the entire Mac Studio, and 128GB is not 405GB. Unified memory wins on cost-per-GB at the top, not in the middle.

For the 400GB-plus class, the Nvidia alternative is not a normal stack of consumer cards, but a multi-accelerator server with enough A100/H100/H200-class memory to keep the model resident. And don't forget the power, cooling, chassis, and interconnect complexity that implies. Pricing for that kind of setup starts in the high five figures and walks confidently into six. The Mac, for all its eye-watering RAM upgrade pricing, is the cheap option at that tier.

At the more reasonable end, the comparison gets sharper. A MacBook Pro M4 Max with 128GB and a terabyte of storage costs about the same as a well-specced gaming PC built around a 5090. The PC takes the speed crown for games and small models. The MacBook Pro handles anything between 30B and 100B parameters, which covers most of the interesting models worth running locally.

You don't need to go out and buy a Mac

It's a niche hobby

Credit:

None of this argues for retiring the gaming PC, and none of this is to say that you should go out and purchase a Mac just to run local AI. Local AI is still, by and large, a rather niche hobby, but it's interesting to see how Apple Silicon's architecture accidentally positioned itself as being a perfect alternative for local AI to the best consumer-grade Nvidia GPUs.

The RTX 5090 is still a great card, as are many of the other cards below it and in different generations. However, for the specific job of running large local LLMs, the architecture Apple landed on almost by accident as a power-efficiency play for laptops turned out to be the right shape for a workload nobody was thinking about when it was designed. Unified memory at this scale is something Nvidia has no consumer answer for yet. Nvidia's DGX Spark, GB10-based systems like the ThinkStation PGX, and AMD's Strix Halo are early entries in the high-capacity unified-memory space, but they top out far below Apple's 512GB ceiling and offer less memory bandwidth than the M3 Ultra.

For most of what I bought the 5090 for, it's still the obvious choice. My workloads don't just involve local LLMs, and for the kind of machine learning and deep learning projects that I run, CUDA is still incredibly valuable. But for local LLMs specifically? The gap still feels much wider than I expected. Apple Silicon does this better than my high-end gaming PC, and I honestly can't believe it.

URL: https://www.xda-developers.com/rtx-5090-cant-keep-up-apple-silicon-biggest-local-llms/

⇱ My RTX 5090 can't keep up with Apple Silicon on the biggest local LLMs, and I hate to admit it

32 GB isn't as high a ceiling as it sounds

Memory bandwidth doesn't matter if the model doesn't load

Slower than the 5090, faster than impossible

A model that runs is better than a model that doesn't run at all

The price gets less ridiculous the more you look at it

It's cheaper than some of the most advanced clusters

You don't need to go out and buy a Mac

It's a niche hobby

URL: https://www.xda-developers.com/rtx-5090-cant-keep-up-apple-silicon-biggest-local-llms/

⇱ My RTX 5090 can't keep up with Apple Silicon on the biggest local LLMs, and I hate to admit it

32 GB isn't as high a ceiling as it sounds

Memory bandwidth doesn't matter if the model doesn't load

Slower than the 5090, faster than impossible

A model that runs is better than a model that doesn't run at all

The price gets less ridiculous the more you look at it

It's cheaper than some of the most advanced clusters

Subscribe to the newsletter for Apple Silicon vs GPU LLM insight

You don't need to go out and buy a Mac

It's a niche hobby