I Tested gpt-oss 20B on an RTX 3090 – Here’s How Fast It Runs Locally
Update: This article has been updated to include new Linux performance benchmarks with Flash Attention (FA) (2025.10.26)
OpenAI has released gpt-oss-20b, a new state-of-the-art open-weight language model under the Apache 2.0 license. These models are designed to deliver strong performance at a low cost and are optimized for efficient deployment on consumer hardware. The gpt-oss-20b model, in particular, is positioned for on-device use cases and local inference.
As soon as the GGUF files became available, I wanted to see how this new model performs on common enthusiast hardware. This article covers my initial speed tests of the gpt-oss-20b model on a single RTX 3090. The focus here is purely on inference performance – prompt processing and token generation speed- not on the quality of the model’s responses.
Test Setup and Model Details
The tests were run on two systems, both equipped with a 24GB NVIDIA RTX 3090. The first system is my daily-driver desktop running a standard Windows 11 installation. The second is a dedicated machine running Linux. For these tests, I used the gpt-oss-20b-MXFP4.gguf file, as it was the primary format available at the time of writing.
The gpt-oss-20b model is a Mixture-of-Experts (MoE) model, which means that while it has 20 billion total parameters, only about 3 billion are active during any given inference pass. This architecture is key to its performance. The model supports a massive context window of up to 131,072 tokens, though OpenAI suggests a minimum of 16,384 for its reasoning tasks. In all tests, the full model was offloaded to the GPU’s VRAM.
Performance on Windows 11 with an RTX 3090
On my Windows 11 machine using LM Studio, the model feels very responsive. With a 32k context, the model loaded in about 15 seconds. This context size occupied approximately 18GB of VRAM for the model and context data. My system’s OS and background processes consumed around 3.9GB of VRAM, which is a significant consideration.
Screenshot of LM Studio running GPT-OSS on an RTX 3090, pushing 33 tokens/sec. Task: one-shot landing page creation — and honestly, the results weren’t bad at all.
This VRAM overhead is the main limiting factor on Windows. The largest context I could reliably use was 36,000 tokens before running into VRAM limits. With more aggressive VRAM management, it might be possible to fit a slightly larger context, but 36k was the practical ceiling for my setup.
Here are the performance metrics from text summarization and simple coding tasks.
| Context Size | Prompt Processing (t/s) | Token Generation (t/s) |
|---|---|---|
| 2,000 | 2012.66 | 35.89 |
| 4,000 | 1900.03 | 34.78 |
| 8,000 | 1549.58 | 32.14 |
| 10,000 | 1301.33 | 29.81 |
| 16,000 | 1230.87 | 28.37 |
| 20,000 | 1030.44 | 26.49 |
| 32,000 | 821.00 | 24.34 |
| 36,000 | 689.00 | 21.79 |
The results show a clear and expected trend. As the context length increases, both prompt processing and token generation speeds decrease. The drop in generation speed is noticeable, going from nearly 36 t/s at a small context to under 22 t/s at the VRAM limit.
Performance on Linux with an RTX 3090
Switching over to my Linux machine, I used llama.cpp’s server with the Open WebUI front-end. The immediate and most significant advantage was the reduced VRAM overhead from the operating system. This allowed for much larger context sizes.
On Linux, I was able to load and work with a context of 52,000 tokens, which consumed about 23GB of the RTX 3090’s 24GB of VRAM. This is a substantial increase of 16,000 tokens over what was possible on my Windows setup.
Here are the original performance benchmarks from the Linux system (without Flash Attention):
| Context Size | Prompt Processing (t/s) | Token Generation (t/s) |
|---|---|---|
| 2,000 | 2895.28 | 114.24 |
| 4,000 | 2843.91 | 109.26 |
| 8,000 | 2505.87 | 99.46 |
| 10,000 | 2426.21 | 97.72 |
| 16,000 | 2075.87 | 86.46 |
| 32,000 | 1542.08 | 69.84 |
| 50,000 | 1002.07 | 55.33 |
The performance on Linux was consistently higher across the board. The token generation speeds were particularly impressive, staying well above 50 t/s even with a 50k context.
New Tests with Flash Attention (FA)
I then re-ran the benchmarks using Flash Attention (FA) enabled on the same RTX 3090 setup with gpt-oss 20B. The addition of FA allowed the model to handle much larger context windows – up to the maximum context of 131,072 tokens supported by gpt-oss 20B – while also achieving notably higher prompt processing and token generation speeds than the previous Linux tests.
| Context Size | Prompt Processing (t/s) | Token Generation (t/s) |
|---|---|---|
| 4,000 | 4400.30 | 147.53 |
| 8,000 | 3989.47 | 140.70 |
| 16,000 | 3243.60 | 128.51 |
| 32,000 | 2547.19 | 112.53 |
| 45,000 | 2136.34 | 103.24 |
| 57,000 | 1862.49 | 94.67 |
| 65,000 | 1720.56 | 89.64 |
| 86,000 | 1395.23 | 79.10 |
| 131,000 | 923.79 | 62.18 |
With Flash Attention, the system not only maintained strong throughput but also achieved significant efficiency improvements, particularly in prompt processing. Even at extreme context lengths beyond 65k tokens, performance remained stable and usable – an impressive result for a 20B parameter model on a single RTX 3090.
Analysis: Linux with and without Flash Attention
Comparing the two sets of Linux results highlights the substantial gains brought by Flash Attention. While the non-FA benchmarks were already faster than Windows, the FA-enabled runs delivered further acceleration across both prompt and generation speeds. More importantly, the new configuration allowed for much deeper context loading, pushing all the way to 131k tokens, which was previously impossible on this hardware.
In practice, this means smoother handling of extremely long prompts, improved responsiveness, and overall higher utilization of GPU compute resources. For anyone experimenting with large models locally, enabling Flash Attention under Linux provides a tangible leap in both performance and context capability.
Analysis: Windows vs. Linux on the Same Hardware
Comparing the sets of results continues to show a clear performance advantage when running on Linux. At every comparable context length, the Linux setup delivered faster speeds, and with Flash Attention enabled, the margin widened even further.
For prompt processing at a 32k context, Linux without FA was already nearly twice as fast as Windows (1542 t/s vs. 821 t/s). With Flash Attention enabled, this jumped to over 2500 t/s, showing just how much more efficiently the GPU can be utilized under Linux when modern attention optimizations are active.
The most substantial difference remains in token generation speed. Without FA, the Linux system produced tokens at 69.84 t/s at 32k context, compared to 24.34 t/s on Windows. With FA, token generation climbed past 110 t/s at the same context size, further improving interactivity and responsiveness during real-time inference.
For any enthusiast looking to push their hardware to its limits, these results reaffirm that a dedicated Linux setup – especially with Flash Attention enabled – is the optimal environment for maximizing both performance and usable context length on large local models.
Conclusion
Based on these initial tests, the gpt-oss-20b model is a very fast and promising model for local inference, especially considering its parameter count. The MoE architecture delivers excellent prompt processing and token generation speeds on a consumer card like the RTX 3090.
This evaluation did not assess the model’s quality in reasoning, coding, or instruction following. However, if its capabilities are as strong as its performance, it could become a go-to model for many local LLM users. Its speed makes it a compelling option, and it will be interesting to see how its output quality compares to other models in the 20-35B parameter range, such as recent releases from the Qwen and Gemma families. For now, its performance alone makes it a model that every hardware enthusiast should try.
Allan Witt
<p>Allan Witt is the co-founder and Editor-in-Chief of Hardware-Corner.net. Computers and the web have fascinated him since childhood. In 2011, he began training as an IT specialist at a mid-sized company while launching a tech blog on the side—quickly discovering a passion for writing about hardware and technology.</p> <p>After completing his training, Allan worked as a system administrator for two years. Alongside that, he started building and upgrading custom gaming PCs at a local hardware shop. What began as a part-time project grew into a full-time career. Today, his work also focuses on building and optimizing PC systems for local AI and LLM workloads, combining hands-on experience with a passion for making complex tech easy to understand.</p>0 Comments
Related
Desktops
Dell refurbished desktop computers
If you are looking to buy a certified refurbished Dell desktop computer, this article will help you …
Guides
Dell Outlet and Dell Refurbished Guide
For cheap refurbished desktops, laptops, and workstations made by Dell, you have the option to use …
Guides
Refurbished, Renewed, Off Lease
When you are looking for refurbished computer, you often see – certified, renewed, and off-lease placed in …
Laptops
Excelent Refurbished ZenBook Laptops
If you are looking for a compact ultrabook and a reasonable price, consider a refurbished Asus Zenbook …

Submit a Comment Cancel reply