![]() |
VOOZH | about |
Allan Witt is the co-founder and Editor-in-Chief of Hardware-Corner.net. Computers and the web have fascinated him since childhood. In 2011, he began training as an IT specialist at a mid-sized company while launching a tech blog on the side—quickly discovering a passion for writing about hardware and technology.
After completing his training, Allan worked as a system administrator for two years. Alongside that, he started building and upgrading custom gaming PCs at a local hardware shop. What began as a part-time project grew into a full-time career. Today, his work also focuses on building and optimizing PC systems for local AI and LLM workloads, combining hands-on experience with a passion for making complex tech easy to understand.
Nov. 4, 2025 / Hardware Insights
When I recently helped set up an LLM inference server for a client, I ran into a problem that may sound familiar to anyone mixing different GPUs. I had an RTX Pro 6000 Workstation (95 GB VRAM) and an RTX 5090 (32 GB VRAM). The goal was simple: run vLLM setup without wasting available memory....
Nov. 3, 2025 / Hardware Insights
When one of YouTube’s biggest creators decides to build a personal AI supercomputer, the local LLM scene takes notice. PewDiePie’s journey into AI hardware has produced a multi-GPU, 424GB VRAM workstation that many enthusiasts dream of. While his budget is far beyond the average builder, his component choices and setup offer a valuable blueprint for...
Nov. 2, 2025 / Hardware Insights
At Hardware Corner, we set out to create a data-driven benchmark hierarchy for local LLM inference – focusing on the two workloads that define real-world performance: prompt processing and token generation. Using llama.cpp’s latest llama-bench on Ubuntu 24.04 with CUDA 12.8, we measured a wide range of GPUs across model sizes, context lengths, and quantization...
Oct. 28, 2025 / LLM Benchmarks
As someone who runs language models locally, I know that VRAM is the one resource we can never have enough of. Every parameter, every token of context, and the growing KV cache all chip away at that precious memory. To cut through the speculation and get hard data, I decided to benchmark some of today’s...
Oct. 24, 2025 / Hardware Insights
This guide presents several PC build options at different price points for enthusiasts looking to run large language models (LLMs) on their local machines. These are templates designed for performance and value in LLM inference. You can adjust them based on component availability and your specific budget. At the moment, RAM prices are unusually high,...
Oct. 18, 2025 / LLM Hardware News
Llama.cpp local LLMs on AMD GPUs just got faster - the latest RADV Vulkan driver update delivers up to 13% higher prompt processing performance
Oct. 16, 2025 / LLM Benchmarks
I recently completed extensive local LLM inference benchmarks on the NVIDIA RTX 5090 32 GB. My primary focus was gathering raw performance data on critical metrics for the local enthusiast: prompt processing speed (PP), token generation throughput (TG), and the maximum context window I could sustain using 4-bit quantization (Q4_K_XL). My goal here is to...
Oct. 15, 2025 / LLM Hardware News
The long-awaited Nvidia DGX Spark is finally here, and the first benchmarks for local LLM inference have landed. Georgi Gerganov of ggml-org has put the machine through its paces with the latest llama.cpp, giving us the raw data we need.
Oct. 13, 2025 / LLM Hardware News
For local LLM enthusiasts, the hardware landscape is in constant motion. We are always searching for the next breakthrough that delivers more VRAM and memory bandwidth for our dollar. While multi-GPU setups using used server cards have been the go-to solution, a new class of powerful APUs, or “big APUs,” is emerging. AMD fired the...
Oct. 9, 2025 / How to Run LLMs Localy
For anyone running LLMs locally, the goal is always more performance for less cost. We obsess over VRAM, memory bandwidth, and squeezing every last token per second out of our hardware. While prompt processing (TTFT) is often fast, the token generation that follows can be a bottleneck, especially on memory-bandwidth-limited systems. This one-token-at-a-time process, called...
Oct. 8, 2025 / How to Run LLMs Localy
If you’ve gotten your hands on an AMD Ryzen AI Max+ 395 (Strix Halo) system, you already know the raw hardware is impressive. That massive pool of unified LPDDR5x memory is a game-changer for running large models locally. But unlocking its full potential isn’t just plug-and-play. The key to getting the best possible performance lies...
Sep. 29, 2025 / How to Run LLMs Localy
Unlock significant speed gains for large language models on your own hardware without sacrificing quality. Here’s how it works and how to set it up in popular inference engines. Why Local LLMs Run Slow If you run large language models on your own hardware, you know the biggest challenge is inference speed. Getting high-quality models...
Sep. 15, 2025 / How to Run LLMs Localy
For local LLM enthusiasts, the race for models with larger “context lengths” feels like the next frontier. While developers boast models that can “remember” entire novels, the practical reality for anyone running hardware at home is that a bigger context window directly translates to a massive hit on your system’s resources, especially your precious VRAM....
Sep. 11, 2025 / Hardware Insights
Let’s be honest: cloud LLMs are incredibly powerful and mostly free. GPT-5, Gemini Pro, Claude Sonnet 4 – you can use them for almost unlimited queries without hitting hard limits. I personally combine Gemini and ChatGPT when one hits a rate limit, and it works perfectly. So why would you want to run models locally?...
Sep. 10, 2025 / LLM Benchmarks
After testing the gpt-oss-20B model on a single RTX 3090, I had to push things further and see what the new heavyweight could do. In addition to the 20B model, OpenAI also released gpt-oss-120B, a massive 120-billion parameter open-weight Mixture-of-Experts (MoE) model with 5.1 billion active parameters. I first ran some experiments on an RTX...
Sep. 5, 2025 / How to Run LLMs Localy
Running large language models locally requires smart resource management. Quantization is the key technique that makes this possible by reducing memory requirements and improving inference speed. This practical guide focuses on what you need to know for local LLM deployment, not the mathematical theory[1] behind it. For the technical mathematical details of quantization, check out...
Sep. 4, 2025 / Local LLM
Large Language Models (LLMs) have rapidly emerged as powerful tools capable of understanding and generating human-like text, translating languages, writing different kinds of creative content, and answering questions in an informative way. You’ve likely interacted with them through services like ChatGPT, Claude, or Gemini. While these cloud-based services offer convenience, there’s a growing interest in...
Aug. 31, 2025 / LLM Hardware News
For local LLM enthusiasts, VRAM has always been the main constraint when choosing hardware. Now, a new option is becoming more accessible at a price point that’s hard to ignore. The Huawei Atlas 300I Duo, an AI inference card from China, is showing up on platforms like Alibaba for under $1500, offering an impressive 96...
Aug. 27, 2025 / LLM Hardware News
The latest rumors around AMD’s upcoming RDNA5 flagship, codenamed AT0, suggest a 512-bit memory bus paired with GDDR7. For anyone running large quantized LLMs locally, this is the part of the leak worth paying attention to – not the shader counts or gaming benchmarks. If the leak is accurate, bandwidth and VRAM capacity could finally...
Aug. 27, 2025 / LLM Hardware News
NVIDIA has just published a paper detailing a new family of language models, Jet-Nemotron, which claims to deliver massive performance gains while maintaining the accuracy of today’s top open-source models. For local LLM users constantly battling VRAM limits and slow inference speeds, this research could point to a significant shift in how we run models...