Voozh

Z.ai released GLM 4.7 Flash only a few days ago, but meaningful local testing had to wait. The initial llama.cpp support was incomplete, and without proper fixes it was not possible to measure real performance. Those fixes have now landed, and with the latest llama.cpp build we were finally able to test the model properly on consumer hardware.

This article is not a review of GLM 4.7 Flash’s reasoning or coding quality. We are strictly looking at hardware behavior. VRAM usage, context scaling, prompt processing speed, and token generation speed. The goal is simple: understand what hardware you actually need to run this model locally without guesswork.

What GLM 4.7 Flash Is

GLM 4.7 Flash is a 30B class Mixture of Experts model designed for local deployment. Despite the headline parameter count, only about 3.6B parameters are active at inference time. This is why it behaves very differently from dense 30B models in both VRAM usage and speed.

The model supports up to a 200K context window and is positioned for coding, agent workflows, and long-context reasoning. Our tests use the 4 bit Unsloth AI quantized GGUF, specifically GLM-4.7-Flash-UD-Q4_K_XL.gguf.

All benchmarks were run on Ubuntu 24.04 with CUDA 12.8, NVIDIA driver 570, and the latest llama.cpp.

VRAM Requirements by Context Length

VRAM is the primary constraint for this model, and the numbers are very clean. At 4K context, GLM 4.7 Flash uses about 17 GB of VRAM. At 8K it rises to roughly 18 GB, and at 16K it reaches 19 GB. By 32K context, usage is around 20 GB.

Context Length	VRAM Requirement (GB)
4k	17
8k	18
16k	19
32k	20
45k	21
57k	23
65k	23
86k	25
131k	30

The important breakpoint is 65K context. At that point the model uses about 23 GB of VRAM. This is a critical number because it means a single 24 GB GPU can handle very large contexts without any tricks. Even at 86K context the model stays under 25 GB. The full 131K context requires around 30 GB of VRAM, which pushes it into 32 GB GPU territory.

For most real workloads 65K context is more than enough. From a practical standpoint, a single 24 GB card already covers almost every reasonable use case.

Hardware Fit and Practical Recommendations

For GLM 4.7 Flash, the sweet spot is clearly a single 24 GB GPU. Cards like the RTX 3090 can load the model comfortably at 65K context and still leave some headroom. There is no need for multi GPU setups unless you explicitly want to run beyond 100K context.

A 32 GB GPU such as the RTX 5090 extends the usable range up to around 131K context on a single card. That is useful for very large repositories or long-running agent memory, but it is not required for most users.

Unified memory systems are also viable here. Because the active parameter count is low, GLM 4.7 Flash behaves much better on shared memory machines than dense models of similar size. However, our focus in this article is discrete GPU performance.

RTX 3090 Performance Results

The RTX 3090 remains one of the best value GPUs for local inference, and GLM 4.7 Flash fits it extremely well.

At 4K context, prompt processing runs at roughly 2000 tokens per second, with token generation around 93 tokens per second. As context grows, prompt processing scales down predictably. At 16K context, prompt processing is just over 1000 tokens per second, and generation stays close to 63 tokens per second.

Context	Prompt Processing (t/s)	Token Generation (t/s)
4k	2013.48	93.37
8k	1517.46	82.51
16k	1046.50	62.89
32k	622.14	43.37
45k	474.06	34.93
57k	385.09	30.19
65k	344.51	26.76

At 32K context, prompt processing drops to about 620 tokens per second, while generation remains usable at around 43 tokens per second. Even at 65K context, prompt processing is still around 345 tokens per second, and generation stays near 27 tokens per second.

In practical terms, this means the RTX 3090 can handle very large contexts without becoming frustrating to use. Token generation remains fast enough for interactive coding and chat even at extreme context sizes.

RTX 5090 Performance Results

On the RTX 5090, the model scales almost linearly with memory bandwidth and compute. At 4K context, prompt processing exceeds 5000 tokens per second, while token generation is around 158 tokens per second.

At 16K context, prompt processing remains above 3000 tokens per second, and generation stays close to 139 tokens per second. Even at 65K context, prompt processing is still over 1100 tokens per second, with generation around 95 tokens per second.

Context	Prompt Processing (t/s)	Token Generation (t/s)
4k	5083.73	157.60
8k	4140.79	150.23
16k	3084.50	138.74
32k	1996.22	120.73
45k	1563.69	109.42
57k	1278.88	99.49
65k	1144.61	94.64
86k	878.36	84.92
131k	587.89	67.42

At the extreme end, 131K context still runs comfortably. Prompt processing is just under 600 tokens per second, and token generation remains around 67 tokens per second. This is unusually good behavior for a model with this context length and highlights the advantage of the MoE design.

FP4 and Blackwell-Specific Behavior

On the RTX 5090, we also tested MXFP4 GGUF quantizations. These are slightly faster than Q4_K in several scenarios. The reason is straightforward. Blackwell has native FP4 support, and llama.cpp can take advantage of it.

The improvement is not dramatic in token generation, but prompt processing does see a noticeable uplift, especially at shorter contexts. This makes FP4-based quants a good option on Blackwell-class GPUs, while older cards will see little benefit.

Comparison to Similar MoE Models

It is natural to compare GLM 4.7 Flash to Qwen3 30B A3B and GPT-OSS 20B.

GLM 4.7 Flash and Qwen3 30B A3B have very similar VRAM behavior. Both load a 65K context in about 23 GB of VRAM, which makes them ideal for 24 GB GPUs. In raw speed, Qwen3 tends to be faster at short contexts on high-end GPUs like the RTX 5090, especially in prompt processing. GLM 4.7 Flash is slightly slower there but remains very consistent as context increases.

GPU	GLM 4.7 Flash (32k context)	Qwen3 A3B (32k context)
RTX 5090	PP: 1996.22 t/s TG: 120.73 t/s	PP: 2877.53 t/s TG: 110.65 t/s
RTX 3090	PP: 622.14 t/s TG: 43.37 t/s	PP: 1336.79 t/s TG: 87.21 t/s

GPT-OSS 20B is a different class of model. It is much smaller and extremely compact. It can load its full 131K context in about 15 GB of VRAM and is significantly faster in both prompt processing and token generation. On the RTX 5090, GPT-OSS can exceed 290 tokens per second in generation at small contexts. The trade-off is model capacity and reasoning depth, not hardware efficiency.

From a hardware perspective, GLM 4.7 Flash sits cleanly between these models. It offers much larger capacity than GPT-OSS while remaining easy to run on a single consumer GPU, similar to Qwen3 MoE.

Conclusion

GLM 4.7 Flash is an excellent fit for local inference hardware. Its MoE design keeps VRAM requirements low, and its scaling behavior is predictable and forgiving. A single 24 GB GPU can handle up to 65K context comfortably, which covers almost all real-world coding and agent workloads.

The RTX 3090 remains a strong value option, delivering usable speeds even at extreme context sizes. The RTX 5090 pushes performance further and enables very large contexts with high responsiveness, especially when using FP4 quantization.

From a hardware standpoint, GLM 4.7 Flash does exactly what a local-first MoE model should do. It delivers large-context capability without forcing users into multi GPU setups or expensive workstation cards.

URL: https://www.hardware-corner.net/glm-4-7-flash-llm-hardware/

⇱ We Tested GLM-4.7 Flash 30B MoE — Here’s the GPU You Actually Need | Hardware Corner

We Tested GLM-4.7 Flash 30B MoE — Here’s the GPU You Actually Need