VOOZH about

URL: https://www.hardware-corner.net/hacker-unlocks-3-node-nvidia-dgx-spark-clustering-for-distributed-llm-inference/

⇱ Hacker Unlocks 3-Node NVIDIA DGX Spark Clustering for Distributed LLM Inference | Hardware Corner


Hacker Unlocks 3-Node NVIDIA DGX Spark Clustering for Distributed LLM Inference

Allan Witt Jan 12, 2026 at 11:12am PDT
💬 0 Comments
👁 screen shot of a cluster of 3 dgx sparks

A recent Reddit thread in r/LocalLLaMA has drawn attention from the local LLM community after a developer (u/k-Pomegranate1314) successfully clustered three NVIDIA DGX Spark systems, a configuration NVIDIA does not officially support today. The work required writing a custom NCCL network plugin from scratch, roughly 1500 lines of C, to bypass assumptions baked into NVIDIA’s default networking stack.

For local LLM enthusiasts, this is less about enterprise bragging rights and more about what it reveals: the DGX Spark platform can be pushed further than advertised, and networking, not raw compute, is the real constraint.

What is  DGX Spark

DGX Spark is NVIDIA’s compact ARM based AI workstation built around unified memory rather than discrete GPU VRAM. It pairs an NVIDIA Grace CPU with Blackwell class GPU compute and fast LPDDR5 memory, delivering roughly 200 to 275 GB/s of local memory bandwidth. Each unit includes dual ConnectX-7 Ethernet NICs rated at up to 200 GbE, although PCIe Gen5 x4 limits practical throughput per port.

For inference, Spark behaves more like a shared memory accelerator than a traditional multi-GPU box. There is no separate VRAM pool, which changes how RDMA and data movement work.

Why Three Nodes Was a Problem

NVIDIA officially supports clustering two DGX Sparks. Going beyond that exposed a limitation in NCCL’s default networking assumptions. NCCL expects all peers to be reachable through a single NIC on a shared subnet, which is how switched fabrics are normally deployed.

In this setup, the three Sparks were connected directly in a triangle mesh using DAC cables. Each node-to-node link lived on its own subnet. NCCL could not correctly select which NIC to use for which peer, and collective operations failed.

Rather than adding a switch, the developer wrote a custom NCCL plugin that understands the topology.

A Custom NCCL Mesh Plugin

The solution was a fully custom NCCL network backend with subnet-aware NIC selection. For each peer, the plugin chooses the correct ConnectX-7 port, sets up RDMA queue pairs directly using libibverbs, and manages memory registration and completion queues without relying on NCCL’s built-in IB plugin.

A custom TCP based handshake was added to avoid deadlocks during connection setup. The data path itself uses RDMA over Converged Ethernet v2, not TCP sockets.

This approach works around the lack of a shared subnet and allows all three nodes to communicate simultaneously over point-to-point links.

Measured Bandwidth Between Nodes

Early benchmarks shared in the thread focused on all-reduce bandwidth rather than tokens per second. With two nodes, measured throughput reached roughly 10 GB/s. With three nodes in the mesh, bandwidth dropped to around 7.4 to 7.6 GB/s depending on message size.

Given 100 GbE class links without PFC or ECN tuning, this corresponds to about 60 to 65 percent of line rate, which is reasonable for RoCE in a non-switched environment.

It is important to note that this bandwidth is between memory pools on different Sparks. Local memory access within a single Spark remains far higher, in the 200 to 270 GB/s range.

Unified Memory Changes the RDMA Story

One point of confusion in the discussion was GPUDirect RDMA. DGX Spark does not support the traditional nvidia-peermem path used on discrete GPUs. However, because Spark uses unified memory, RDMA writes land directly in memory that the GPU can already access.

In practice, this provides GPUDirect-like behavior without a separate kernel module. The measured throughput suggests there is no extra staging copy through system memory or PCIe.

For local LLM users, this is an important architectural distinction. Spark is closer to a NUMA style shared memory accelerator than a GPU with a hard VRAM boundary.

Scaling and the Role of Switches

The custom plugin is currently targeted at a three node setup, but there is no fundamental reason it could not scale further. Ironically, scaling would be easier with a switch, because NCCL’s default assumptions would apply.

Several commenters pointed out that relatively affordable MikroTik switches can handle 100 to 200 GbE RoCE for a few thousand dollars, far less than traditional enterprise InfiniBand gear. The direct mesh approach avoids a switch entirely but increases software complexity.

From a performance per dollar perspective, this is a tradeoff local builders will recognize immediately.

Why This Matters for Local LLM Inference

This project does not suddenly make DGX Spark the best value inference box. Token generation benchmarks are still in progress, and Spark remains expensive compared to used multi-GPU x86 builds.

What it does show is that Spark’s networking and memory model are more flexible than NVIDIA’s supported configurations suggest. With enough low level work, multiple units can be combined into a larger shared memory pool, exceeding 350 GB of usable RAM across three nodes.

For very large quantized models that are memory bound rather than compute bound, this kind of clustering is interesting. It also highlights how much performance is left on the table due to software assumptions rather than hardware limits.

Takeaway for Enthusiasts

For most local LLM users, a pile of used RTX 3090s still wins on tokens per second per dollar. But this experiment demonstrates that DGX Spark can be coerced into multi-node inference roles with respectable interconnect bandwidth, even without official support.

More broadly, it is a reminder that NCCL, RDMA, and topology awareness matter just as much as FLOPs and memory size. As ARM based and unified memory systems become more common, we are likely to see more of this kind of custom infrastructure work trickle down from enterprise ideas into enthusiast setups.

👁 Google
Set as Preferred Source

No comments yet.