VOOZH about

URL: https://www.hardware-corner.net/guides/rtx-pro-6000-gpt-oss-120b-performance/

⇱ Testing GPT-OSS 120B on RTX Pro 6000 Blackwell: What 96GB of VRAM Gets You | Hardware Corner


Testing GPT-OSS 120B on RTX Pro 6000 Blackwell: What 96GB of VRAM Gets You

Last updated: | Author: Allan Witt

OpenAI recently released gpt-oss 120b, a powerful 120-billion parameter open-weight MoE model with 5 billion active parameters. For local LLM enthusiasts, this model represents a new opportunists, but also a significant hardware challenge. The quantized versions of the model are still very large, and the MXFP4 GGUF file I’m testing is a hefty 65GB. This size means it won’t fit on any consumer-grade GPU.

This is where workstation cards like the NVIDIA RTX Pro 6000 Blackwell edition come in. With a massive 96GB of VRAM, it’s the only prosumer GPU that can load this model and its full context. This article details my hands-on testing of the gpt-oss 120b model on this card, focusing purely on inference speed and hardware performance.

The Test System and Software

My setup consists of an Intel Core Ultra 7 265K CPU, 192GB of DDR5 system memory, and the RTX Pro 6000 Blackwell GPU.

On the software side, getting everything to work required some updates. I am running Ubuntu 24.04 LTS. I had to upgrade to CUDA 12.8 and PyTorch 2.8, along with the latest NVIDIA drivers (version 575.57.08) to ensure full compatibility with the new hardware. For inference, I used the llama.cpp server (version 6112) with Open Web UI as the frontend. The model file is the unsloth gpt-oss-120b-F16.gguf.

A factor in the performance numbers is that the latest version of llama.cpp now fully supports “attention sinks.” This optimization can increase prompt processing speed up to three times for the gpt-oss model, and at the time of writing this article, this feature is only supported for gpt-oss.

Initial Model Loading and VRAM Usage

The first test is simply loading the model. The gpt-oss-120b-F16.gguf file loaded in about 10 seconds. With the full 131,072 token context allocated and Flash Attention turned off, the model consumed 83.17 GB of VRAM and an additional 90 GB of system RAM.

👁 screenshot gpustat on linux showing llama-server taking 81 gb vram with gpt-oss loaded at 131k context

GPU VRAM usage: GPT-OSS-120B loaded with 131k token context consumes 83.17 GB of VRAM on the RTX Pro 6000.

This initial memory footprint immediately shows why a 96GB card is necessary. Even the RTX 5090, with its 32GB of VRAM, would need a three-card setup to handle this workload.

👁 Image

GPU VRAM usage : GPT-OSS-120B loaded with 131k token context and Flash Attention

Subsequent testing with Flash Attention reveals a performance increase. The most significant change is the VRAM usage, which dropped from 91 GB to 67 GB at maximum context, a massive reduction that opens up more headroom on the card.

Performance Benchmarks at Different Context Lengths

The key performance metrics for local LLMs are prompt processing speed (how fast the model ingests your input) and token generation speed (how fast it writes the response). I tested the model at various context lengths to see how performance scales. All layers of the model were offloaded to the GPU.

Context Prompt Processing Prompt Eval Time Token Generation VRAM Used
12,000 2542.57 t/s 4.89 s 134.71 t/s 84 GB
24,000 2038.94 t/s 11.83 s 115.31 t/s 85 GB
48,000 1388.29 t/s 34.65 s 87.08 t/s 86 GB
80,000 953.45 t/s 84.23 s 65.53 t/s 88 GB
131,000 605.30 t/s 216.26 s 48.45 t/s 91 GB

With flash attention

Context Prompt Processing Prompt Eval Time Token Generation VRAM Used
12,000 3526.76 t/s 3.56 s 148.17 t/s 62 GB
24,000 3418.14 t/s 7.04 s 136.51 t/s 63 GB
48,000 3097.07 t/s 15.68 s 118.59 t/s 64 GB
80,000 2735.22 t/s 29.49 s 101.83 t/s 65 GB
131,000 2259.66 t/s 57.41 s 83.03 t/s 67 GB

With Flash Attention the prompt processing speeds saw a major boost, increasing by nearly 3.7x at the largest context sizes. Token generation is also substantially faster, jumping from 48 t/s to over 83 t/s at full context, making the model feel much more responsive and interactive.

Benchmark Analysis

The results show a clear and expected trend. Prompt processing speed is extremely high for smaller contexts but decreases as the context window fills. Ingesting over 130,000 tokens at once still happens at over 600 tokens per second, which is very usable for processing large documents.

Token generation speed is the metric most users feel directly. At a 12k context, the 134 t/s is fast and fluid. This speed gradually decreases as the context window grows, hitting 48 t/s at the maximum context length. This is still a very respectable speed for interactive use, largely thanks to the model’s Mixture of Experts (MoE) design, which activates only 5 billion parameters during inference – allowing efficient performance even with a massive context window.

VRAM consumption scales linearly with the context length, starting at 84GB and climbing to 91GB at the maximum context. This leaves a sufficient 5GB buffer on the card, preventing any out-of-memory errors.

The Value Proposition: Single Card vs. Multi-GPU

Consider the alternative for running this 120B model: three RTX 5090s. At a projected price of $2,500 each, the total cost comes to $7,500. This might seem cheaper, but it doesn’t account for the complexity. You would need a motherboard with adequate PCIe spacing, a powerful PSU (likely 1600W or more), and a case with enough airflow to manage three high-power GPUs. The power draw and heat output would be substantial. A single, dual-slot RTX Pro 6000 fits in a standard desktop case and runs on a single 16-pin power connector, making it a simple, plug-and-play solution.

The time saved on managing a complex multi-GPU setup, along with lower power consumption and potentially higher resale value, makes the single-card approach very attractive. It just works.

Conclusion

The RTX Pro 6000 is a beast of a card, especially for a model like gpt-oss 120B. It makes running a state-of-the-art, 100B+ parameter model on a local desktop a practical reality. The performance is solid, offering fast prompt processing and usable generation speeds even with a massive 131,000 token context.

While the price tag is significant, the value comes from its simplicity. For a technical enthusiast who wants to run the largest available models without the headaches of a multi-GPU build, the RTX Pro 6000 provides a straightforward path. You get the VRAM of three to four consumer cards in a single, efficient package. For those serious about pushing the limits of local AI, this card represents a major step forward in capability and convenience.

Allan Witt

<p>Allan Witt is the co-founder and Editor-in-Chief of Hardware-Corner.net. Computers and the web have fascinated him since childhood. In 2011, he began training as an IT specialist at a mid-sized company while launching a tech blog on the side—quickly discovering a passion for writing about hardware and technology.</p> <p>After completing his training, Allan worked as a system administrator for two years. Alongside that, he started building and upgrading custom gaming PCs at a local hardware shop. What began as a part-time project grew into a full-time career. Today, his work also focuses on building and optimizing PC systems for local AI and LLM workloads, combining hands-on experience with a passion for making complex tech easy to understand.</p>

0 Comments

Submit a Comment Cancel reply

Related

Desktops
Dell refurbished desktop computers

If you are looking to buy a certified refurbished Dell desktop computer, this article will help you …

Guides
Dell Outlet and Dell Refurbished Guide

For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option to use …

Guides
Refurbished, Renewed, Off Lease

When you are looking for refurbished computer, you often see – certified, renewed, and off-lease placed in …

Laptops
Excelent Refurbished ZenBook Laptops

If you are looking for a compact ultrabook and a reasonable price, consider a refurbished Asus Zenbook …