VOOZH about

URL: https://www.hardware-corner.net/mac-rdma-over-thunderbolt-llm/

⇱ RDMA over Thunderbolt Lets Mac Studios Run Huge Local LLMs Faster Than Expected | Hardware Corner


RDMA over Thunderbolt Lets Mac Studios Run Huge Local LLMs Faster Than Expected

Allan Witt β€’ Dec 19, 2025 at 1:53am PDT
πŸ’¬ 0 Comments
πŸ‘ mac studio m3 ultra and exo with rdma over thunderbolt v2

Apple quietly unlocked something important for local LLM users in macOS 26.2: RDMA over Thunderbolt. Combined with the public release of Exo 1.0, this turns multiple Mac Studios into a low latency memory pooled system that behaves very differently from the usual multi node setups local users are used to.

This is not about cloud AI or benchmarks for press slides. This is about what actually changes when you try to run very large models locally, especially models that do not fit in a single machine and normally slow down badly when split across nodes.

The first real tests come from hardware focused YouTubers (Jeff Geerling and jakkuh), but the results line up well enough that they are worth treating as an early signal rather than isolated demos.

What RDMA over Thunderbolt Actually Does

RDMA stands for Remote Direct Memory Access. In simple terms, it lets one machine read and write memory on another machine without going through the OS networking stack. No TCP, no kernel buffering, and far fewer context switches.

On Macs with Thunderbolt 5, RDMA runs directly over the Thunderbolt fabric. Latency drops dramatically compared to Ethernet or even standard Thunderbolt networking. In practice, the machines can access each other’s unified memory almost like it is local, at least from the point of view of the Exo runtime.

This matters for LLM inference because token generation is latency sensitive. If every token requires waiting on another node over TCP, performance collapses as you add more machines.

How RDMA Is Enabled on macOS

RDMA is disabled by default. Enabling it requires physical access and recovery mode.

Each Mac Studio must be booted into recovery, a terminal opened, and the RDMA control command enabled. After a reboot, RDMA becomes available to software like Exo. This is a deliberate security choice by Apple and not something you can toggle remotely.

You also need Thunderbolt 5 capable Macs. In these tests that means M3 Ultra systems.

The Hardware Setup Being Tested

Both testers are running the same basic configuration.

Four Mac Studios in a Thunderbolt mesh.

  • Two systems with M3 Ultra, 80 core GPU, and 512 GB unified memory.
  • Two systems with M3 Ultra, 80 core GPU, and 256 GB unified memory.

That gives a total of 1.5 TB of unified memory. For MLX models, that is enough to load an 8 bit quantized model close to 1 TB and still leave room for context, KV cache, and runtime overhead.

All systems are connected directly to each other with Thunderbolt cables. There is no switch. Because of current macOS limitations, only four nodes can participate in RDMA this way.

Why Exo Matters More Than llama.cpp Here

llama.cpp clustering works by splitting model layers across nodes and passing partial results token by token. This works, but it scales poorly as node count increases because each token becomes a relay race.

Exo takes a different approach. With RDMA enabled, Exo can treat the cluster as a shared memory pool and shard tensors rather than entire layers. That changes the scaling behavior completely.

The practical question local users care about is simple: does adding nodes actually make things faster, or does it just let bigger models fit?

Performance Results on Large Models

The following tests are done by Jeff Geerling and focus on throughput in tokens per second. Large context prompt processing was not benchmarked, so these numbers only reflect steady state generation.

Qwen3 A 235B A22B, 8 bit MLX, 242 GB

On a single node, RDMA does not help much. Performance is roughly the same as TCP based clustering. This is expected because the model already fits comfortably in one machine.

Once the model is split, the difference becomes clear. With two nodes, Exo with RDMA is noticeably faster than llama.cpp over TCP. With four nodes, the gap widens further, reaching roughly double the throughput of the TCP setup.

πŸ‘ benchmarks graph pf running qwen3 235b on a mac studio cluster with rdma over thunderbolt

This answers a common question directly. Yes, when a model fits in two machines, RDMA based memory pooling can be faster than running it on one node alone.

DeepSeek V3.1 671B, 4 bit MLX, 378 GB

This model shows an even clearer benefit. Even on a single node, Exo with RDMA edges out TCP. As nodes are added, TCP performance stagnates or drops, while RDMA scales upward.

πŸ‘ benchmarks graph pf running deepseek v3.1 on a mac studio cluster with rdma over thunderbolt
πŸ‘ Image

At four nodes, Exo reaches more than twice the throughput of llama.cpp. This is the first time many local users have seen a 600B class model scale positively beyond two nodes without exotic networking.

Kimi K2 Thinking 1T A32B, 4 bit MLX, 658 GB

This model cannot run on a single node at all. Two nodes are the minimum.

With two nodes, RDMA gives a modest improvement. With four nodes, the improvement becomes substantial, pushing throughput into a range that feels usable rather than experimental.

πŸ‘ benchmarks graph pf running kimi k2 1t on a mac studio cluster with rdma over thunderbolt

Mixture of experts models still scale worse than dense models, but RDMA clearly reduces the penalty.

What This Means for Local LLM Enthusiasts

The most important takeaway is not raw tokens per second. It is that multi node Mac setups no longer behave like a last resort for models that barely fit.

With RDMA and Exo, splitting a model across two or four machines can actually improve throughput instead of hurting it. That changes how you think about buying hardware.

Instead of chasing one massive box, a small cluster of high memory Macs can now act like a single large memory system with reasonable scaling.

Performance per Dollar Reality Check

None of this is cheap. The tested setup is close to forty thousand dollars at retail pricing. That puts it far outside the reach of most hobbyists.

However, the comparison point matters. Running 600B to 1T parameter models locally at usable speeds usually implies enterprise GPUs, InfiniBand, and power budgets far beyond a home lab.

In that context, four quiet machines pulling well under a kilowatt total start to look less absurd, especially for researchers or studios who already value macOS workflows.

Stability and Software Maturity

This is early technology. RDMA over Thunderbolt is new, Exo is young, and MLX based tooling is still evolving. Crashes and odd limitations were reported, including strict naming requirements and limited model formats.

Large context performance remains an open question, and that matters a lot for real world use.

Still, the direction is clear. Apple has enabled something that makes shared memory clustering practical on consumer hardware, and Exo is the first tool to really exploit it.

Bottom Line

RDMA over Thunderbolt does not magically turn Mac Studios into data center nodes. But it does remove the biggest penalty local users faced when splitting large models across machines.

For the first time, adding nodes can mean faster inference, not just bigger models. For local LLM enthusiasts pushing into the 200B to 1T parameter range, that is a meaningful shift.

πŸ‘ Google
Set as Preferred Source

Read more

No comments yet.