![]() |
VOOZH | about |
Jan. 22, 2026 / Hardware Insights
Z.ai released GLM 4.7 Flash only a few days ago, but meaningful local testing had to wait. The initial llama.cpp support was incomplete, and without proper fixes it was not possible to measure real performance. Those fixes have now landed, and with the latest llama.cpp build we were finally able to test the model properly...
Dec. 11, 2025 / Hardware Insights
Mistral AI has just released its new coding model, Devstral 2. Weβve been using its predecessor, Devstral Small, locally for code completion and have been very impressed with its performance. Early reports on Devstral 2 put it on par with other top models like Kimi K2 and Deepseek v3.2, so we were eager to get...
Nov. 10, 2025 / Hardware Insights
Iβve been testing the --n-cpu-moe flag in llama.cpp to see how much it improves performance with large Mixture of Experts models. The standard method of splitting layers between the GPU and CPU can be slow for these models. This flag offers a more targeted approach by moving just the expert layers to system RAM while...
Nov. 2, 2025 / Hardware Insights
At Hardware Corner, we set out to create a data-driven benchmark hierarchy for local LLM inference β focusing on the two workloads that define real-world performance: prompt processing and token generation. Using llama.cppβs latest llama-bench on Ubuntu 24.04 with CUDA 12.8, we measured a wide range of GPUs across model sizes, context lengths, and quantization...
Oct. 28, 2025 / LLM Benchmarks
As someone who runs language models locally, I know that VRAM is the one resource we can never have enough of. Every parameter, every token of context, and the growing KV cache all chip away at that precious memory. To cut through the speculation and get hard data, I decided to benchmark some of todayβs...
Oct. 17, 2025 / LLM Benchmarks
I tested the RTX 4090 with five quantized models to measure real-world inference performance for local LLM workloads. This is the second article in my GPU benchmark series, following my recent RTX 5090 tests. I ran these benchmarks to provide concrete performance data across different model sizes and context lengths using llama.cpp. Testing Environment My...
Oct. 16, 2025 / LLM Benchmarks
I recently completed extensive local LLM inference benchmarks on the NVIDIA RTX 5090 32 GB. My primary focus was gathering raw performance data on critical metrics for the local enthusiast: prompt processing speed (PP), token generation throughput (TG), and the maximum context window I could sustain using 4-bit quantization (Q4_K_XL). My goal here is to...
Sep. 16, 2025 / LLM Benchmarks
When choosing a local LLM, one of the first specifications to check is its context window. The context size determines how many tokens you can feed into the model at once, which directly affects practical use cases like long-form reasoning, document analysis, or multi-turn conversations. For hardware enthusiasts running quantized models on limited VRAM, knowing...
Sep. 10, 2025 / LLM Benchmarks
After testing the gpt-oss-20B model on a single RTX 3090, I had to push things further and see what the new heavyweight could do. In addition to the 20B model, OpenAI also released gpt-oss-120B, a massive 120-billion parameter open-weight Mixture-of-Experts (MoE) model with 5.1 billion active parameters. I first ran some experiments on an RTX...