VOOZH about

URL: https://www.hardware-corner.net/memory-bandwidth-llm-speed/

⇱ Memory Bandwidth: How Does It Boost Tokens per Second in Local LLM Inference? | Hardware Corner


Memory Bandwidth: How Does It Boost Tokens per Second in Local LLM Inference?

By | Updated: October 17, 2025

📃Part of Series:
How to Run LLMs Locally

You’ve spent weeks picking out the parts for a powerful new computer. It has a top-tier CPU, plenty of fast storage, and maybe even a respectable graphics card. You download your first large language model (LLM), excited to run it locally, only to find the experience is agonizingly slow. The text trickles out one word at a time, making any real conversation impossible. If this sounds familiar, you’re not alone. While a fast processor and lots of memory are important, the true bottleneck for a responsive LLM often lies in a specification many overlook: memory bandwidth. This guide will explain what memory bandwidth is, why it’s the most critical factor for inference speed, and how you can get more of it—even on a tight budget.

Why is my local LLM slow?

It’s a common and frustrating problem. You might have a CPU with many cores or a GPU with plenty of VRAM, but the model still feels sluggish. The reason is that running an LLM is less about raw calculation and more about constantly moving massive amounts of data. The model’s parameters—its “knowledge”—are stored in memory. For the LLM to generate a single word, the processor has to read gigabytes of these parameters from memory. The speed at which it can read that data is determined by memory bandwidth. If that connection is too slow, your powerful processor will spend most of its time waiting for data, leading to a slow and choppy user experience.

What is memory bandwidth?

Think of your computer’s memory (VRAM on a GPU or RAM for the system) as a massive library containing all the knowledge the LLM needs. Your processor (the GPU or CPU) is a researcher who needs to constantly run back and forth to this library to look up information to form a response. Memory bandwidth is the width of the highway between the researcher and the library. A wider highway allows for more data to be moved at once.

For a GPU, this highway’s width is determined by two key factors: its memory bus width (measured in bits) and its memory speed (measured in Gbps). The bus width is like the number of lanes on the highway, while the memory speed is how fast the traffic moves in each lane. The final bandwidth is calculated with a simple formula:

Memory Bandwidth (GB/s) = (Memory Bus Width / 8) * Memory Speed

System RAM operates on a similar principle, but we describe its components differently: memory channels and memory speed (measured in MT/s, or mega-transfers per second). The number of channels (e.g., dual-channel, quad-channel) defines how many parallel paths exist between the CPU and the RAM. The calculation is also straightforward:

Memory Bandwidth (GB/s) = Memory Speed (MT/s) * Number of Channels * 8 / 1000

In both cases, a wider path (more bus width or channels) combined with faster data movement results in higher total bandwidth, which allows your processor to access the model’s knowledge faster and, therefore, generate text more quickly.

GPU VRAM vs RAM  for local LLMs?

There is a massive difference, and it’s the primary reason why GPUs are the undisputed kings of LLM inference speed. Sticking with our highway analogy, this isn’t just about having a fast road; it’s about the entire road system’s design.

The memory bus is the most critical part of this design—it determines the number of lanes on the highway. The memory type, like the upcoming GDDR7, sets the speed limit for data traveling in those lanes. Total data throughput, or bandwidth, is the result of multiplying the number of lanes by the speed limit.

Take NVIDIA RTX 5090 and It’s massive 512-bit bus for example. This is the equivalent of a colossal 16-lane superhighway. When you have data on all 16 of those lanes traveling at the projected 28 Gbps speed, the result is a total memory bandwidth of 1790 GB/s. That’s the ability to move nearly 1,800 gigabytes of model weights and LLM context every single second.

In contrast, your main system RAM, like DDR5, is more like a general-purpose road system designed to handle all sorts of traffic. A typical consumer PC with dual-channel DDR5-5000 has a memory bandwidth of about 80 GB/s. This enormous gap is why running an LLM on your CPU often feels dramatically slower than running it on even a mid-range GPU.

Feature System RAM VRAM
Primary Goal Low Latency (quick response for CPU tasks) High Bandwidth (massive throughput for GPU cores)
Speed per Pin 5,000 MT/s = 5 Gbps 28 Gbps (over 5x faster)
Typical Bus Width 128-bit (in a dual-channel setup) 512-bit (for a high-end GPU like an RTX 5090)
Physical Form Removable sticks (DIMMs) in motherboard slots. Chips soldered directly onto the graphics card PCB.

How can I increase bandwidth on system RAM?

The secret to unlocking more bandwidth from system RAM lies in memory channels. Think of a channel as a single dedicated lane on that memory highway. The more channels your CPU and motherboard support, the wider the highway becomes.

Consumer platforms, like those using Intel Core or AMD Ryzen CPUs, are almost always limited to dual-channel (two lanes). This is perfectly fine for gaming and everyday tasks, but for the intense data demands of an LLM, it creates a significant bottleneck. This is why running even medium-sized models on a standard desktop can feel painfully slow.

This is where workstation and server platforms shine. Processors like Intel Xeon or AMD Threadripper are built for heavy data workloads and support quad-channel (four lanes), octa-channel (eight lanes), or even dodeca-channel (twelve lanes) memory configurations.

Table: The Impact of Channels and Memory Speed on PC & Server Bandwidth

This table illustrates the theoretical peak memory bandwidth of traditional system RAM, comparing different speeds of DDR4 and DDR5 memory. It highlights the critical role of the memory controller’s channel count, showing how bandwidth scales from standard consumer PCs (2-channel) to high-end desktops (4-channel) and powerful server platforms (8 and 12-channel).

Channels DDR4-2600 DDR4-3600 DDR5-5400 DDR5-6000
2 Channels 41.6 57.6 86.4 96.0
4 Channels 83.2 115.2 172.8 192.0
8 Channels 166.4 230.4 345.6 384.0
12 Channels 249.6 345.6 518.4 576.0

What matters more for LLM: channels or faster RAM?

Absolutely. For LLM inference, the number of memory channels is often far more important than the raw speed (MHz) of the RAM modules themselves. This creates a fascinating opportunity for anyone looking to build a powerful LLM machine without breaking the bank.

Here’s a real-world example: a server built with twelve channels of older, cheaper DDR4-3200 RAM can achieve a total memory bandwidth of around 307 GB/s. In contrast, a high-end consumer desktop with four channels of newer, more expensive DDR5-6000 RAM might top out at around 192 GB/s. The system with more channels, despite using older and slower RAM, has over 50% more bandwidth.

The takeaway is clear: for serious CPU-based inference, the number of memory channels supported by your platform is one of the most powerful hardware choices you can make.

How does memory bandwidth affect tokens per second?

The difference in user experience between a low-bandwidth and high-bandwidth system is night and day. More bandwidth directly translates to more tokens per second (t/s), which determines how fast the model generates text.

A standard consumer CPU with dual-channel RAM, offering around 83 GB/s of bandwidth, might only produce about 5 tokens per second. This feels painfully slow and is barely usable for interactive chat. Stepping up to a workstation CPU with octa-channel RAM boosts bandwidth to over 200 GB/s, pushing performance into the 20-30 t/s range. This is a much more usable speed that begins to approach the feel of an entry-level GPU.

Once you move to dedicated GPUs, the performance leaps again. An entry-level GPU like an RTX 3060 provides 360 GB/s of bandwidth, delivering a fluid and conversational experience at around 35 t/s. At the top end, a powerhouse like the RTX 5090 with its 1790 GB/s of bandwidth makes text generation feel near-instantaneous, easily exceeding 80-100 t/s.

What About Unified Memory Architectures?

Unified memory architectures are a game-changer because they eliminate the traditional separation between system RAM and VRAM. Instead of the CPU and GPU having their own separate memory pools, they share a single, high-speed pool of memory. This tight integration creates a very wide highway for data, delivering impressive bandwidth without needing a separate, power-hungry graphics card.

Apple popularized this approach with its M-series chips. By placing high-bandwidth memory directly on the processor package, they created a system that functions like a very wide, multi-channel setup by default. A top-tier chip like the M3 Ultra, for instance, boasts an incredible 800 GB/s of memory bandwidth. This explains why modern Macs are so surprisingly competent at running large models—they deliver the memory bandwidth of a high-end workstation in a consumer package.

Now, this powerful technology is making a significant entry into the PC world with processors like AMD’s Ryzen AI Max+ 395 (Strix Halo). This chip uses a wide 256-bit memory bus with fast LPDDR5X memory to achieve a total bandwidth of 256 GB/s. While not as high as Apple’s top-end offering, 256 GB/s is a massive leap over the ~83 GB/s found in a typical dual-channel desktop. It puts these integrated systems squarely in the performance territory of quad-channel workstation platforms, making them a compelling option for running LLMs efficiently.

Table: Unified Memory Bandwidth Look at Modern SoCs from Apple and AMD

This table compares the memory bandwidth of modern System on a Chip (SoC) architectures, which utilize a unified memory approach. Unlike traditional PCs with separate RAM slots, these SoCs integrate LPDDR memory directly onto the chip package.

Model Memory Type / Speed Memory Bus (bits) Bandwidth (GB/s)
Apple M2 Max LPDDR5-6400 512 409.6
Apple M3 Ultra LPDDR5-6400 1024 819.3
Apple M4 Max LPDDR5X-8533 512 546.1
AMD Ryzen AI MAX+ 395 LPDDR5X-8000 256 256

What should I look for in a GPU for LLM inference?

When comparing GPUs for LLM inference, your first priority is VRAM capacity—you need enough to fit your desired model. But if two cards have the same amount of VRAM, the next specification to check is “Memory Bandwidth,” measured in GB/s. The card with the higher bandwidth rating will almost always deliver more tokens per second and a smoother experience.

70B+ models with decent speed on a budget?

Running a 70B—or even a 200B+—model might sound like it requires a monster GPU, but there’s a smarter way: pair the right model architecture with carefully chosen hardware.

The key is Mixture-of-Experts (MoE) models. Unlike dense models that load all their parameters every step, MoE models only activate a few experts at once. That means far less data pulled from RAM. For example, a 120B model might sit at 65GB quantized, but only ~3B parameters fire per token. Qwen3-235B or GLM-4.5-Air work the same way: massive on paper, lightweight in practice.

Now comes the hardware trick. Used workstations with CPUs like Threadrippers or Xeons give you 8–12 memory channels. Fill them with second-hand DDR4 sticks, and suddenly you’ve got both the capacity to hold the model and the bandwidth to feed the active parameters fast. Token generation—the speed at which text rolls out—is largely bandwidth-bound, and these rigs excel there. The tradeoff? Prompt processing. The initial wait after you hit enter is slower, since CPUs just can’t match GPUs for raw parallel compute.

Budget-friendly hardware picks:

  • 70B – MacBook Pro M1 Max, 64 GB, 400 GB/s, ~33 t/s – $1200
  • 100B–120B – Ryzen AI MAX+, 96–128 GB, 256 GB/s, ~40 t/s – $1500–$2000
  • 235B – Mac Studio M1 Ultra, 128 GB, 800 GB/s, ~15 t/s – $2500

In short: with the right model (MoE) and the right memory-heavy machine, you can run massive models smoothly—without breaking the bank.

What hardware gives the fastest LLM performance

For pure, uncompromising speed, nothing beats a high-end GPU with the most memory bandwidth you can afford. This means prioritizing cards like the NVIDIA RTX 3090 (936 GB/s) or the RTX 4090 (1,008 GB/s). Their massive VRAM capacity combined with extreme bandwidth delivers the fastest possible token generation for local LLMs.

Conclusion

If there’s one thing to remember, it’s this: for LLM inference, bandwidth is the king of speed. When you’re choosing a GPU, look beyond the core count and prioritize the card with the highest GB/s rating. When you’re considering a CPU-based system, forget about clock speed and focus on the number of memory channels your platform supports. Understanding this fundamental principle puts you in control, allowing you to build a system that doesn’t just run LLMs, but makes them fly.

Read more: Run LLMs Locally