First Nvidia DGX Spark LLM Benchmarks Are In: Does It Beat Strix Halo
The long-awaited Nvidia DGX Spark is finally here, and the first benchmarks for local LLM inference have landed. Georgi Gerganov of ggml-org has put the machine through its paces with the latest llama.cpp, giving us the raw data we need. For the home enthusiast focused on price-to-performance, the central question is simple: is this the new hardware to aim for, or do existing DIY solutions and competitors like AMD’s Strix Halo still hold the value crown? Let’s dig into the numbers and see how it stacks up.
More Than Just Hardware: The AI Lab Experience
Before we jump into tokens per second, it is important to understand what the DGX Spark is. It is not just a piece of hardware; it is a pre-configured development environment. This is a machine designed for you to unbox, power on, and immediately start experimenting with advanced AI workflows.
The operating system comes with a suite of useful tools pre-installed and ready to go. You get a RAG (Retrieval-Augmented Generation) and multi-agent software setup right out of the box, complete with tool calling and search capabilities. A number of Docker containers are also pre-loaded to power these features, including models like DeepSeek Coder and Qwen2.5-VL, along with a Postgres database used by the provided Nvidia chat interface. This “AI lab” approach means you spend less time on setup and more time learning and building.
Blackwell’s Quantization Advantage: The Role of NVFP4
A core advantage of the Blackwell architecture, and central to the DGX Spark’s performance, is the introduction of NVFP4. This proprietary 4-bit data format is engineered to address the primary performance ceiling of the hardware: its memory bandwidth. The DGX Spark relies on integrated LPDDR5 memory, which tops out at approximately 275 GB/s. For large models, this can be a significant choke point compared to high-end discrete GPUs.
NVFP4 tackles this challenge by reducing the numerical precision of the model’s weights. This exchange results in a much smaller memory footprint and, critically, a substantial acceleration in inference speed because less data needs to be moved from memory to the processing cores. This specialized quantization format is not just a theoretical benefit; it is already supported by major inference frameworks, including Nvidia’s own TensorRT-LLM and the widely used VLLM library, making it a practical advantage. Comprehending this trade-off is fundamental to grasping how the DGX Spark can efficiently run large models that would otherwise be crippled by its memory system.
The Raw Numbers: DGX Spark Performance
Here are the initial benchmark results from llama.cpp running on the DGX Spark. The tests measure prompt processing (pp), which is how quickly the model ingests the initial prompt, and token generation (tg), the speed at which it produces the response.
| Model | Size | Params | Test | Tokens/Second |
|---|---|---|---|---|
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | pp2048 | 3621.59 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | tg32 | 58.97 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | pp2048 | 1723.07 |
| gpt-oss 120B MXFP4 MoE | 59.02 GiB | 116.83 B | tg32 | 38.55 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | pp2048 | 2916.25 |
| qwen3moe 30B.A3B Q8_0 | 30.25 GiB | 30.53 B | tg32 | 47.08 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | pp2048 | 817.00 |
| glm4moe 106B.A12B Q4_K | 67.85 GiB | 110.47 B | tg32 | 18.45 |
*pp – prompt processing; tg – token generation
DGX Spark vs. The Competition
Raw numbers are useful, but context is everything. To understand the DGX Spark’s value, we must compare it to other popular and upcoming hardware for local inference. We will focus on the gpt-oss 120B model, as it provides a common point of comparison across different setups.
| Hardware | Prompt Processing (pp2048 t/s) | Token Generation (tg32 t/s) |
|---|---|---|
| Nvidia DGX Spark (MXFP4) | 1723.07 | 38.55 |
| AMD Strix Halo (MXFP4) | 339.87 | 34.13 |
| Apple M3 Ultra 256GB (MXFP4) | 863.73 | 70.79 |
| 3x RTX 3090 (MXFP4) | 1641.89 | 124.03 |
Analysis: What Do The Numbers Mean for Us?
Looking at the comparison table, a clear picture emerges. The DGX Spark demonstrates extremely strong prompt processing speed, thanks to the Blackwell architecture and NVFP4. It is even slightly faster than a formidable 3x RTX 3090 rig in this metric. This means it can ingest large contexts very quickly.
However, the token generation speed tells a different story. At around 38 tokens per second on the 120B model, it is only marginally faster than the AMD Strix Halo APU. More importantly, it is significantly slower than what a DIY multi-GPU setup can achieve. A well-configured 3x 3090 system, which can be built from used parts for a comparable or even lower price, delivers over three times the token generation speed. Even a single RTX 4090 paired with fast system RAM can achieve similar token generation speeds for large models, while being much faster for smaller models that fit entirely in its VRAM.
This brings us to the crucial matter of price. The DGX Spark is positioned as a premium product ($4000+). When you can assemble a multi-GPU system with 72GB of VRAM that offers superior generation performance for potentially less money, the value proposition of the Spark for pure inference becomes questionable. It seems to compete more directly with hardware in the 1500 to 2000 range, like the Strix Halo systems, rather than high-end DIY builds.
Of course, there is the software ecosystem to consider. The DGX Spark runs on CUDA, which is mature and stable. AMD’s ROCm, while improving at a rapid pace and now reportedly beating Vulkan in some tests, is still seen as a platform in development. For many, the reliability of CUDA is a major selling point. But for the price-conscious enthusiast, AMD’s raw hardware-per-dollar is hard to ignore, especially as its software stack matures.
The Verdict for the Local LLM Enthusiast
So, is the Nvidia DGX Spark the right choice for a local LLM user? The answer depends entirely on your priorities.
If you are a developer or researcher who values a seamless, out-of-the-box experience for experimenting with modern AI workflows like RAG and multi-agent systems, the Spark is an excellent tool. It is an AI lab that saves you setup time and gives you a clear path for scaling your work to larger enterprise systems.
However, if your primary goal is maximizing raw token generation performance for the lowest possible cost, the DGX Spark is not the answer. A carefully planned DIY rig using multiple used GPUs like the RTX 3090 still reigns supreme in performance-per-dollar. Meanwhile, the AMD Strix Halo is emerging as efficient alternative that offers decent performance in a compact form factor, making the landscape for local AI hardware more interesting than ever.
Read more
No comments yet.

Leave a Reply Cancel reply