VOOZH about

URL: https://www.hardware-corner.net/dgx-station-gb300-local-llm/

⇱ This Desktop Machine Runs 1T Parameter LLMs Locally | Hardware Corner


This Desktop Machine Runs 1T Parameter LLMs Locally

By Allan Witt | Updated: March 19, 2026

The NVIDIA DGX Station built around the GB300 Grace Blackwell Ultra is not just another workstation with a big GPU. It is closer to a single-node inference server designed around one idea: remove the boundary between VRAM and system RAM while keeping GPU compute in control.

You get 252 GB of HBM3e at 7.1 TB/s and 496 GB of LPDDR5X at 396 GB/s, connected through NVLink-C2C at 900 GB/s. In practice this behaves like a unified memory system, but not in the same way as consumer platforms. The GPU can directly access system memory and still execute compute on it. That is the key difference.

For local LLM use cases, this changes how we think about model size limits.

Unified memory for LLM inference

On a typical desktop, when you overflow VRAM, you offload to system RAM and pay a heavy penalty. The CPU becomes involved, bandwidth drops, and tokens per second collapse.

Here, the GPU still drives execution even when accessing LPDDR5X. That means:

  • The HBM acts as a high speed working set.
  • The LPDDR5X acts as a large extension for weights, KV cache, or MoE experts.

This is especially relevant for Mixture of Experts models. You can keep frequently used experts in HBM and push rarely used ones into system memory without fully falling back to CPU execution.

In practice, this makes very large models usable in a way that is not possible on multi GPU PCIe setups.

Why FP4 and NVFP4 matter

The DGX Station is clearly optimized around FP4. It delivers massive FP4 throughput and NVIDIA is pushing NVFP4 quantization as the next step after FP8.

For local inference, this matters more than raw FP16 numbers.

FP4 allows:

  • Lower memory footprint per parameter
  • Higher effective model size within fixed VRAM
  • Higher throughput due to reduced bandwidth pressure

Compared to typical 4-bit integer quantization, NVFP4 is designed to preserve more accuracy in transformer workloads. The expectation is that future inference stacks will favor FP4 over INT4 for high end deployments.

For enthusiasts, this means the “real ceiling” of the machine is not BF16, but FP4 or FP6.

👁 dgx station partner models

Available DGX Station OEM systems

NVIDIA does not sell a single reference box. Instead, partners ship their own versions built around the same GB300 platform.

Current systems include ASUS ExpertCenter Pro ET900N G3, MSI XpertStation WS300, Dell Pro Max with GB300, Gigabyte W775-V10-L01, SuperMicro Super AI Station, and HP ZGX Fury AI Station.

All of them share the same core architecture, so differences are mostly in chassis, support, and configuration options.

What models fit in 252 GB HBM3e

This is the cleanest scenario. No offloading, full GPU speed.

Model Params Quantization Memory
Qwen3.5-122B-A10B 122B BF16 ~250 GB
Qwen3 235B 235B 6-bit ~202 GB
Qwen3.5-397B-A17B 397B 4-bit ~237 GB
MinMax-M1 456B up to 6-bit ~194 GB
MinMax-M2.5 230B up to 8-bit ~243 GB

This is already beyond what most multi GPU consumer rigs can do cleanly. The key advantage is no tensor parallel overhead and no PCIe bottleneck.

What models fit with unified memory (HBM + LPDDR5X)

Once you start using the full 748 GB coherent memory pool, the range expands significantly.

Model Params Quantization Memory
MinMax-M1 456B BF16 ~457 GB
Qwen3.5-397B-A17B 397B 8-bit ~428 GB
Qwen3-Coder-480B-A35B 480B 8-bit ~548 GB
DeepSeek v3.1 671B up to 8-bit ~574 GB
GLM5 744B up to 6-bit ~645 GB
Kimi-K2.5 ~1T native int4 ~595 GB

This is where the DGX Station becomes unique. You are not just fitting models, you are running models that normally require multi-node clusters.

There is one important caveat. These numbers are weights only. KV cache and long context will push memory usage higher. In practice, you will need to budget extra headroom depending on sequence length.

👁 dgx station partner models

Performance reality vs multi GPU rigs

A common comparison is a 4x RTX PRO 6000 setup.

That setup gives more total VRAM on paper and can be cheaper. But it has two major limitations:

  • PCIe bandwidth limits scaling.
  • All-reduce overhead kills efficiency in tensor parallel workloads.

The DGX Station trades expandability for extremely high internal bandwidth and unified memory. Instead of coordinating multiple GPUs, everything runs inside one coherent system.

For small models or multiple independent models, multi GPU still makes sense. For single large models, especially MoE, the DGX design is more efficient.

Pricing and value

Pricing is not officially listed in most places, but current numbers place it roughly in the 85k to 125k USD range depending on configuration and vendor.

That puts it in an unusual position.

For the same money, you could build a multi GPU system with several high end cards and more flexibility. Or you could rent a large amount of cloud compute time.

The DGX Station only makes sense if you specifically need:

  • On-prem inference
  • Very large models in a single node
  • High concurrency and throughput
  • CUDA optimized workflows

For individual enthusiasts, it is hard to justify. For small labs or companies with privacy or latency constraints, it starts to make sense.

Who this is for

This is not a typical local LLM box.

It is aimed at teams that want to run frontier scale models locally without building a cluster. It also targets developers working inside the CUDA ecosystem who need a consistent environment for development and deployment.

For hobbyists, even advanced ones, a multi GPU setup still offers better performance per dollar in most cases.

Final thoughts

The DGX Station changes one important constraint. It removes the strict boundary between VRAM and system memory while keeping GPU compute active.

That alone enables a class of models that were previously impractical on a single machine.

But it comes at a high cost, both in money and in power.

For most local LLM users, the interesting part is not buying one. It is understanding where the architecture is going. Unified memory with GPU driven compute and ultra low precision formats like FP4 is clearly the direction forward.

Read more: Run LLMs Locally