VOOZH about

URL: https://www.hardware-corner.net/3x-rtx-3090-gpt-oss-120b-test/

⇱ Can Three RTX 3090s Really Run GPT-OSS 120B with Max Context? I Put It to the Test | Hardware Corner


Can Three RTX 3090s Really Run GPT-OSS 120B with Max Context? I Put It to the Test

By Allan Witt | Updated: October 19, 2025

👁 three rtx 3090 gpus connected for inference on llm

After testing the gpt-oss-20B model on a single RTX 3090, I had to push things further and see what the new heavyweight could do. In addition to the 20B model, OpenAI also released gpt-oss-120B, a massive 120-billion parameter open-weight Mixture-of-Experts (MoE) model with 5.1 billion active parameters.

I first ran some experiments on an RTX Pro 6000 Blackwell, but I was too curious not to try a triple-3090 setup and the 128k max context. I don’t own three 3090s myself, so I rented a multi-GPU instance on Vast.ai to put it to the test. The run was done on Linux with FlashAttention enabled – without it, the model simply can’t load large context windows without choking on memory.

Even in quantized form, this thing is huge. The MXFP4 GGUF file I tested – unsloth-gpt-oss-120b-F16.gguf  – is 65 GB on disk. This is not “just throw it on your gaming rig” territory.

Test Setup

The tests were done under Ubuntu 24.04 LTS, with CUDA 12.6, PyTorch 2.7, and NVIDIA drivers version 575.57.08.Inference was served through llama.cpp (version 6112) with Open WebUI as the frontend. This particular build of llama.cpp has full support for attention sinks, a feature that can triple prompt processing speeds for gpt-oss models – and at the time of writing, only gpt-oss benefits from it.

Hardware Specs

  • GPUs: 3 × NVIDIA GeForce RTX 3090 (24 GB each)
  • System RAM: 67 GB
  • Total VRAM in use: 67.5 GB
  • Quantization: MXFP4 GGUF
  • FlashAttention: Enabled

Here’s what I used to run the model from the command line:

./llama-server \
 --model /home/allan/gpt-oss-120b-F16.gguf \
 --port 10000 \
 --ctx-size 94208 \
 --parallel 1 \
 --flash-attn \
 --n-gpu-layers 999

Benchmarks – 3× RTX 3090 with FlashAttention

Below are several test runs at different context sizes.

Context Prompt Processing Prompt Eval Time Token Generation VRAM Used
12,788 1158.55 t/s 11.04 s 73.26 t/s 67 GB
24,919 1160.22 t/s 21.48 s 65.42 t/s 67 GB
48,641 1026.98 t/s 47.36 s 54.31 t/s 67 GB
93,687 833.07 t/s 112.46 s 41.05 t/s 67 GB

Analysis

The prompt processing speeds here are seriously impressive for such a huge model – over 1,100 tokens/sec for smaller contexts and still 833 t/s at nearly 94k tokens. Token generation speed is naturally much lower, but for a 120B-parameter MoE running locally, 41–73 t/s is entirely usable for interactive work.

FlashAttention is the hero here. Without it, large-context loads simply fail on this hardware. The triple-GPU setup makes a huge difference in VRAM headroom – something you’d never get with a single 24 GB card.

Running on Linux also helps squeeze every last drop of memory out of the system. A minimal OS footprint, efficient driver stack, and llama.cpp’s attention sink optimization combine to make this configuration far more capable than a Windows equivalent would be.

Although three RTX 3090s can’t run the full 128k context, reaching 93k tokens is still impressive and unlocks plenty of practical use cases.

Conclusion

The gpt-oss-120B isn’t just big – it’s hardware-hungry. Even quantized, it demands enterprise-level VRAM capacity. But if you can get your hands on three RTX 3090s, you can actually run it locally at respectable speeds.

For local LLM enthusiasts, this setup shows that MoE models scale well with multi-GPU consumer hardware when paired with the right optimizations. It’s not a daily-driver configuration, but it’s absolutely viable for research, fine-tuning experiments, and high-context interactive work.

Read more: Run LLMs Locally