VOOZH about

URL: https://www.hardware-corner.net/pc-builds-for-local-llms/

⇱ Best PC Builds for Local LLMs: From 7B to 123B Models | Hardware Corner


Best PC Builds for Local LLMs: From 7B to 123B Models

By Allan Witt | Updated: January 14, 2026

👁 a desktop pc with dual rtx 3090 GPUs connected with SLI

This guide presents several PC build options at different price points for enthusiasts looking to run large language models (LLMs) on their local machines. These are templates designed for performance and value in LLM inference. You can adjust them based on component availability and your specific budget.

At the moment, RAM prices are unusually high, and this significantly increases the overall cost of a PC build – especially one intended for running local LLMs. If you’re on a budget, sticking to 32GB of RAM is the most cost-effective choice.

While larger models benefit from more memory, going beyond 32GB can raise the total system price dramatically. For reference, many base-level setups running models up to around 14B parameters can still operate with 16GB.

Keep this in mind when planning your build to avoid unnecessary overspending.

These builds are tailored specifically for running Large Language Models on their own hardware. While they are powerful enough for other tasks like gaming or content creation, every component choice is made with LLM performance as the top priority. Our suggestions focus on new hardware to ensure warranty and availability, but we will provide viable second-hand alternatives under each build for the budget-conscious user.

All performance figures for prompt processing and token generation are based on tests conducted on Linux with CUDA 12.8 and the latest version of llama.cpp utilizing flash attention and Q4_K_XL quantization. For now, we are providing only NVIDIA-based solutions. We believe their CUDA ecosystem currently offers the most stable and high-performance experience for local LLM inference. Should AMD or Intel present a more compelling value proposition in the future, we will test their hardware and update our recommendations.

We are focusing on single-GPU builds to reduce complexity for those new to the hardware side of local LLMs. In the future, we will explore multi-GPU configurations. Lastly, all builds except the entry-level option use fast DDR5-6000+ RAM. This provides better performance if you need to offload model layers to system memory when VRAM is full. However, be aware that consumer motherboards use dual-channel memory, which is significantly slower than GPU VRAM. When offloading occurs, inference speed will drop noticeably.

Tips for Building Your LLM PC

Before you start pricing out components, the most important step is to define your primary use case. Are you looking for a system for casual chatting and experimentation with smaller models, or do you need a workhorse for coding and professional tasks that require large models and significant context windows? The answer will directly influence how much VRAM you need and what kind of performance you can expect.

A critical factor to consider is your tolerance for prompt processing time. Loading a large context into a model takes time, often several minutes for bigger prompts on the builds we’ve listed. If you’re accustomed to the near-instantaneous prompt ingestion of cloud services like Gemini, which can process 60,000 tokens in seconds, it’s important to understand that no consumer-grade local setup can match that speed. Knowing these limitations and how much waiting you’re willing to do will help you choose a build that aligns with your expectations and workflow.

  • Buying second hand GPUs, can be a great way to save money. A second-hand RTX 3090, for instance, offers incredible value for running LLMs locally. However, remember that used parts may not come with a warranty, so weigh the risk against the savings.
  • When planning your build, remember that VRAM, GPU bandwidth and GPU computer is the single most important metrics. Your VRAM capacity determines the maximum size of the model, the bandwidth the speed of inference and the compute the speed prompt processing. Go for the GPU that’s great in these areas and fits your budget.
  • If you make changes to the builds, ensure your chosen case and power supply unit (PSU) can accommodate your GPU, as high-end cards can be very large and power-hungry.

Our Build Selection

GPU availability and pricing can fluctuate. The graphics card is the most expensive component in an LLM build, so your choice here will define your system’s capabilities.

  • The Entry Point (12GB VRAM): A budget-friendly build capable of running small to medium models up to 14B parameters, perfect for experimentation and daily tasks.
  • The Capable Workhorse (16GB VRAM): The new sweet spot for value. This build handles models up to 20B parameters and can manage very large context windows with certain model architectures.
  • The Enthusiast’s Sweet Spot (24GB VRAM): A popular choice for serious enthusiasts, this build runs large 30B+ parameter models comfortably, making it a versatile machine for coding, creative writing, and complex tasks.
  • The 32GB Specialist (Under Testing): We are currently evaluating the best component combinations for the 32GB VRAM tier and will update this guide soon.
  • The 48GB Professional Workstation (Under Testing): We are finalizing our tests to ensure we recommend the best value and performance.
  • The 96GB Uncompromised Powerhouse: The ultimate DIY desktop build for running massive models up to 123B parameters without compromise. This is an advanced dual-GPU configuration for those who need maximum power.

👁 This table compares the prompt processing and token generation throughput of the Qwen3 8B (4-bit) model across several NVIDIA GPUs using the llama.cpp benchmark at a 32K token context length. The results illustrate how GPU performance scales with hardware capability and memory bandwidth under 4-bit quantization.

The Entry Point (12GB VRAM)

This build is the most affordable entry into the world of local LLMs. With 12GB of VRAM, you will be able to run small to medium models up to 14B parameters. This includes excellent models like Qwen2 (7B), Llama 3.1 (8B), and Phi-3-medium.

This system can run 8B models with a 4-bit quantization and a 45k context window, generating around 35 tokens per second. The prompt processing speed at this context length is about 358 tokens per second, which means it can take a couple of minutes to ingest a large prompt. For more practical everyday use, a 16k context window delivers a much faster experience with prompt processing at 1600 tokens per second and token generation at 60 tokens per second. For 14B models, you can expect to use a 16k context with 1600 t/s processing and 40 t/s generation.

Comments and Alternatives

This build uses the RTX 3060 12GB, an older but still very capable GPU for a basic LLM machine. For a bit more performance, you could look for a used RTX 4070 12GB, which can be found for around $300 to $400. However, at that price, you are getting close to the cost of a new 16GB card, which represents a much more significant leap in capability for running LLMs.

The Capable Workhorse (16GB VRAM)

With 16GB of VRAM, this build opens the door to running larger and more powerful models, including those up to 20B parameters. It can even handle some Mixture-of-Experts (MoE) models with very large context windows. MoE models are efficient because they only activate a fraction of their total parameters for each token generated, making them faster and less VRAM-intensive than dense models of a similar size.

For example, this GPU can run a model like GPT-NeoX 20B or efficiently handle a large MoE model with an almost maximum context of 120k tokens at 43 tokens per second. The prompt processing at that massive context size is 685 tokens per second, taking about three minutes to ingest the prompt. At a more common 65k context, the prompt processing time drops to about a minute, and inference speed increases to 58 tokens per second. This build can also handle 8B models with a 65k context and 14B models with a 32k context with ease.

Comments and Alternatives

At the moment, there are very few second-hand options that can compete with the value offered by a new RTX 4060 Ti 16GB. Its combination of modern architecture, power efficiency, and VRAM capacity makes it the clear winner in this price bracket for LLM workloads.

The Enthusiast’s Sweet Spot (24GB VRAM)

The 24GB VRAM tier is often considered the sweet spot for serious local LLM enthusiasts. This is where you gain the ability to run very large and capable models between 20B and 36B parameters without significant compromises. Strong performers in this category include Mistral 7B Instruct v0.2 at high context, Qwen2 32B, and various powerful coding and specialized models.

With an RTX 4090, this system demonstrates impressive performance. For a 30B MoE model, it can process prompts at over 1700 tokens per second and generate text at nearly 75 tokens per second. Even a dense 32B model runs smoothly, processing prompts at 1680 t/s and generating at 34 t/s with a 16k context window. This level of performance makes the machine highly responsive for demanding tasks like coding assistance, document analysis, and advanced role-playing.

Comments and Alternatives

While our build features a new RTX 3090. However the undisputed value king in this category is a used RTX 3090. A second-hand 3090 can be acquired for around $750 – $800, drastically reducing the total build cost to a level much closer to our 16GB workhorse build while still providing that crucial 24GB of VRAM. If you are comfortable with used hardware, this is the most cost-effective path to high-end local LLM performance.

The 96GB Uncompromising Powerhouse

This is the ultimate single-GPU desktop build for those who want to run nearly any open-source model available today. The heart of this system is the recently released RTX Pro 6000, a prosumer powerhouse equipped with a massive 96 GB of GDDR7 memory. This configuration allows you to load models up to 123B parameters, such as Mistral Large and OpenAI’s gpt-oss 120b, with their full context window, all on a single card.

The advantage of a single-GPU solution is its simplicity and efficiency. You avoid the complexities of multi-GPU setups, ensuring maximum performance without worrying about interconnect bottlenecks. This build can handle massive models with ease, enabling deep analysis, complex instruction following, and experimentation with state-of-the-art models as soon as they are released.

Comments and Alternatives

At the moment, the RTX Pro 6000 stands in a class of its own for single-card VRAM capacity in the prosumer space. There are no direct single-GPU substitutions that offer this much memory. The only way to achieve similar VRAM is by using a more complex and potentially less efficient dual-GPU setup with two 48GB workstation cards. NVIDIA has announced the RTX Pro 5000 with 72GB of VRAM, which can be a compelling future substitution for those who don’t need the full 96GB, but it is not yet available. For now, if you need this level of VRAM in a single card, this is the build to aim for.

We have also selected components with future upgrades in mind. The chosen case has ample space to accommodate a second large GPU, such as another RTX Pro 6000 or a future RTX 5090. The motherboard supports a dual-GPU configuration. While the CPU’s 24 PCIe lanes mean you cannot run two cards at full x16 bandwidth, the board is designed to split the lanes into an x8/x8 configuration across its two PCIe 5.0 slots, which is more than sufficient for LLM workloads.

If you decide to add a second GPU, you must upgrade the power supply. The PSU in this build is sized for a single RTX Pro 6000. A dual-card setup will require a much more powerful unit, likely exceeding 1600 watts. It is also crucial to ensure your home’s electrical circuit can handle such a significant and sustained power draw. A final note on memory: while the motherboard is rated for very high RAM speeds (8400+ MT/s), achieving these speeds with large 96GB kits is not yet guaranteed, so sticking to a stable 6000 MT/s is recommended for now.

Frequently Asked Questions (FAQ)

Q: Why is VRAM so important for LLMs?
A: When you run an LLM, the model’s parameters (its “weights”) must be loaded into memory for the GPU to access them quickly. VRAM is the fastest memory available to the GPU. If a model is too large to fit entirely in VRAM, parts of it must be swapped to your system’s RAM or even your SSD, which are orders of magnitude slower. This causes a dramatic drop in performance. Therefore, having enough VRAM to hold the entire model is the key to fast and responsive inference.

Q: Can I use two different GPUs, like an RTX 4090 and an RTX 3060?
A: Yes, frameworks like llama.cpp support splitting a model across different GPUs. However, the overall speed will be limited by the slowest card in the chain. It’s generally more efficient to use two identical cards, as this allows for a balanced workload. Also, communication between cards happens over the PCIe bus, which is slower than having the entire model on a single GPU’s VRAM.

Q: What about AMD or Intel GPUs for LLMs?
A: While AMD and Intel are making progress with their software stacks (ROCm for AMD, OpenVINO for Intel), the ecosystem is less mature than NVIDIA’s CUDA. For now, CUDA offers broader support, more consistent performance, and easier setup across the majority of LLM projects and tools. We are actively monitoring the landscape and will recommend AMD or Intel solutions if they become a better value.

Q: Does CPU or system RAM speed matter if the model fits in VRAM?
A: Yes, they still matter, but to a lesser degree than the GPU. The initial “prompt processing” phase, where the model ingests your input, can be CPU-intensive, especially with very long contexts. A faster CPU will speed this up. Fast system RAM (like DDR5) is beneficial for quickly loading models into VRAM and is critical if you ever need to offload layers from VRAM to system RAM.

Q: Is a workstation GPU (like an RTX A6000) better than a gaming GPU (like an RTX 4090)?
A: It depends on your needs. Gaming GPUs like the RTX 4090 offer incredible performance-per-dollar and high clock speeds. Workstation GPUs typically offer much more VRAM (48GB vs 24GB), are built for 24/7 reliability, and often use blower-style coolers that are superior for stacking multiple GPUs right next to each other in a single chassis. For single-GPU builds, a gaming card is usually the better value. For multi-GPU builds aiming for maximum VRAM, workstation cards are often the only practical choice.

Read more: Run LLMs Locally