We Tested Devstral 2 (24B & 123B) — Here’s the Hardware You Actually Need
By Allan Witt | Updated: January 16, 2026
Mistral AI has just released its new coding model, Devstral 2. We’ve been using its predecessor, Devstral Small, locally for code completion and have been very impressed with its performance. Early reports on Devstral 2 put it on par with other top models like Kimi K2 and Deepseek v3.2, so we were eager to get our hands on it. This article is not a review of the model’s coding quality. Instead, this is our hands on test of the hardware needed to run it and the speed you can expect on consumer and prosumer gear.
Devstral 2 comes in two dense model sizes, a 24 billion parameter version and a massive 123 billion parameter version. Both models feature an impressive 256K context window, making them very interesting for complex coding tasks that require understanding a large codebase. Let’s dive into what it takes to run them.
Devstral 2 VRAM Requirements
As with any large language model, the first and most significant hurdle is VRAM. We measured the VRAM usage for 4 bit quantized versions of both models across various context lengths. The numbers show a clear picture of the hardware class needed for each model.
| Context Length | Devstral 2 24B VRAM (GB) | Devstral 2 123B VRAM (GB) |
|---|---|---|
| 4K | 15 GB | 72 GB |
| 8K | 16 GB | 74 GB |
| 16K | 17 GB | 76 GB |
| 32K | 20 GB | 82 GB |
| 57K | 24 GB | 90 GB |
| 65K | 25 GB | 93 GB |
| 131K | 35 GB | 123 GB |
| 262K | 55 GB | 170 GB |
Hardware for Devstral 2 24B
The 24B model is quite accessible for a local LLM enthusiast. The best value GPU for this model is a card with 24GB of VRAM. This will allow you to load the model with a context of up to 57K tokens, which is perfectly usable and very generous for most coding tasks.
If you have a newer GPU like an RTX 5090 with 32GB of VRAM, you will be able to push the context up to around 120K. To reach the model’s maximum 256K context, which requires 55GB of VRAM, you’ll need a multi GPU setup. A triple 24GB GPU setup or a dual RTX 5090 32GB rig would work well. For those with a bigger budget, a single professional card like the RTX Pro 6000 with 96GB is another excellent option.
A very viable alternative is a unified memory machine. A system with 64GB of unified memory, like the AMD Strix Halo laptops/minis or an Apple Silicon machine like a MacBook Pro or Mac Studio with an M series Max or Ultra processor, can handle the full context of the 24B model with ease.
Hardware for Devstral 2 123B
Running the 123B model puts you firmly in the realm of massive models, and the hardware requirements are serious. Just to load the model with a minimal 4K context requires 72GB of VRAM. This means a multi GPU setup, a high capacity unified memory machine, or a top tier professional workstation card is mandatory.
Possible options to get the 96GB of VRAM needed for a usable 71K context include a quadruple 24GB setup or a triple 32GB setup. You could also use dual RTX 4090 48GB cards or a single card like the new RTX Pro 6000 Blackwell.
Unified memory machines are again a strong option here, but you will need to go in to 96GB of memory section. AMD’s Ryzen AI Max+ 395 (Strix Halo) platforms from brands like Xrival, GMKTEC, and BOSGAME offer 96GB and 128GB configurations. These are priced between $1500 for 96GB and up to $2600 for 128GB. For Apple Silicon, you would need a MacBook Pro with an M2 or M3 Max chip for 96GB, or an M3 or M4 Max for 128GB.
Speed Benchmarks
We tested Devstral 2 on a system running Ubuntu 24.04, with CUDA 12.8 and the latest llama.cpp build. All tests used a 4 bit Q4_K quantization. We tested on an RTX 3090 and an RTX Pro 6000 workstation card, and the results were very good. We plan to add benchmarks for Strix Halo and Apple Silicon as soon as we can.
Devstral 2 24B Performance
The RTX 3090 continues to prove its status as the value king for models in the 24B to 32B parameter range. Our benchmarks show that it takes about one minute and 15 seconds to process the maximum possible context of 57K tokens. The subsequent token generation speed is 33 tokens per second, a rate that is perfectly fine for coding assistance. We also tested the older Devstral Small model and found the speeds to be exactly the same.
When we compare Devstral 2 24B to another strong coding model like Qwen 2.5 Coder 32B on the same RTX 3090, Devstral 2 is smaller and faster. At a 16K context, Devstral 2 processes the prompt at 1296 tokens per second, while Qwen 2.5 manages 826 tokens per second. For token generation, Devstral 2 achieves 44 tokens per second compared to Qwen 2.5’s 26 tokens per second.
The RTX Pro 6000, as expected, runs the 24B model very fast. At a 65K context, the prompt loads in around 50 seconds, and it generates new tokens (at the same large context) at 55 tokens per second. This GPU can run the model at its full 256K context. At that maximum length, prompt processing speed is 427 tokens per second, and generation speed is a very usable 25 tokens per second.
Devstral 2 123B Performance
With the 123B model on the RTX Pro 6000, the performance is still impressive for a model of this scale. Prompt processing remains fast, staying above 500 tokens per second even at a 32K context. Token generation speed hovers around 16 tokens per second. While this might feel slow for interactive chat, it is quite effective for tasks like code completion and generation where a slight delay is acceptable.
However, on unified memory systems, the dense nature of this model leads to very low token generation speeds, with early reports indicating an unusable 2-3 tokens per second, unlike MoE models such as GPT-OSS 120B which can reach 50 tokens per second on a Strix Halo machine.
RTX 3090 Devstral 2 24B Benchmarks
| Context Length | Prompt Processing (t/s) | Token Generation (t/s) |
|---|---|---|
| 4K | 1652 | 50 |
| 8K | 1499 | 48 |
| 16K | 1297 | 45 |
| 32K | 1008 | 39 |
| 45K | 862 | 36 |
| 57K (Max) | 756 | 33 |
RTX Pro 6000 Devstral 2 Benchmarks
| Context Length | 24B PP (t/s) | 24B TG (t/s) | 123B PP (t/s) | 123B TG (t/s) |
|---|---|---|---|---|
| 4K | 4973 | 90 | 949 | 19 |
| 8K | 4503 | 85 | 844 | 18 |
| 16K | 3810 | 79 | 729 | 18 |
| 32K | 2707 | 69 | 547 | 16 |
| 45K | 2070 | 64 | 442 | 15 |
| 57K | 1560 | 59 | 358 | 14 |
| 65K | 1405 | 55 | 305 | 13 |
| 86K | 1130 | 49 | N/A | N/A |
| 131K | 795 | 40 | N/A | N/A |
| 262K | 427 | 25 | N/A | N/A |
*PP – Prompt processing; TG – Token generation
Conclusion
Devstral 2 is a very capable and powerful coding model. Our tests show that the 24B version is highly accessible for the local LLM community. It provides excellent performance and a large, usable context on a single 24GB GPU like the RTX 3090, representing fantastic value. The 123B version is a serious tool for enthusiasts with high end multi GPU rigs, powerful unified memory systems, or professional workstation cards. It requires a significant hardware investment but delivers the capabilities of a truly massive model right on your desktop.
