How Memory Chips Determine GPU Memory Bandwidth for Local LLM Inference
By Allan Witt | Updated: February 26, 2026
👁 gddr6 memory chip with solder balls supplying bits to the inference engine for with high bandwidth
If you are running quantized LLMs locally, especially 4-bit models, memory bandwidth usually matters more than raw CUDA core count. Once the model fits in VRAM, inference speed is largely determined by how fast the GPU can stream weights from VRAM into the tensor cores.
For 7B models this is less obvious. For 34B, 70B, the bandwidth becomes one of the main bottlenecks.
This article explains how memory chips determine GPU bandwidth, how per-pin speed interacts with bus width, and what this means for cards like the NVIDIA GeForce RTX 3090, GeForce RTX 5090, and NVIDIA RTX PRO 6000 Blackwell.
The Core Formula: Per-Pin Speed and Bus Width
Every modern discrete GPU follows the same bandwidth formula.
Theoretical memory bandwidth is calculated as:
(Per-pin speed in Gbps × Bus width in bits) / 8 = Bandwidth in GB/s
Memory speed is specified in gigabits per second. We divide by 8 to convert from gigabits to gigabytes.
For example, if memory runs at 21 Gbps on a 384-bit bus:
21 × 384 = 8,064 gigabits per second
8,064 / 8 = 1,008 GB/s
That is how GPUs reach the 1 TB/s class.
Two variables control everything: how fast each pin runs, and how many data lanes exist between the GPU die and the memory chips.
What Bus Width Really Means
Bus width is the total number of parallel data lines between the GPU and its VRAM.
Most modern GDDR chips expose a 32-bit data interface. That means each chip contributes 32 data lanes to the total bus.
RTX 3090 Founders Edition PCB showing 12× 2GB GDDR6X memory chips arranged around the GPU die, forming a 384-bit memory bus and 24GB total VRAM.
If a GPU has a 256-bit memory bus:
256 / 32 = 8 memory chips
If it has a 512-bit bus:
512 / 32 = 16 memory chips
All of these chips operate in parallel. Increasing bus width increases parallelism. Increasing per-pin speed increases how fast each lane transfers data.
It is important to understand that not every solder ball on a memory package is a data pin. A GDDR chip may have well over 100 connections, but only around 32 are data. For inference workloads, the data pins determine usable bandwidth.
GDDR6, GDDR6X, and GDDR7: Why Generation Matters
Memory generation mainly affects per-pin speed.
Typical ranges look like this:
GDDR6 commonly operates between 14 and 20 Gbps per pin.
GDDR6X pushes roughly 19 to 24 Gbps using PAM4 signaling.
GDDR7 starts around 28 Gbps and is moving toward 32 Gbps and beyond.
Recent high-bin GDDR7 parts are rated up to 36 Gbps per pin. New 24 Gb density chips equal 3 GB per module. These higher density modules allow more VRAM without increasing the number of memory channels.
For local LLM users, the implications are direct. Higher per-pin speed increases bandwidth without widening the bus. Higher density increases total VRAM capacity without redesigning the entire memory subsystem.
Bandwidth scales with speed and bus width. Capacity scales with chip density and number of chips.
Real GPU Examples and What the Numbers Mean
RTX 3090: 936 GB/s Class Bandwidth
The NVIDIA GeForce RTX 3090 uses 19.5 Gbps GDDR6X on a 384-bit bus.
19.5 × 384 = 7,488
7,488 / 8 = 936 GB/s
With 24 GB of VRAM and nearly 1 TB/s of bandwidth, it remains strong value on the used market for 13B to 70B q4 models.
In practice, 13B q4 is often compute-bound. Around 30B and above, workloads become more bandwidth-sensitive. The 936 GB/s keeps tensor cores fed more consistently than narrower consumer cards.
RTX 4090: Breaking 1 TB/s
The NVIDIA GeForce RTX 4090 runs 21 Gbps GDDR6X on a 384-bit bus.
21 × 384 = 8,064
8,064 / 8 = 1,008 GB/s
Same 24 GB capacity as the 3090, but higher bandwidth and architectural improvements. For dual-GPU 70B splits, two 4090s scale better than older cards because each GPU has around 1 TB/s of local memory bandwidth.
RTX 5060 Ti 16GB: Fast Memory, Narrow Bus
The NVIDIA GeForce RTX 5060 Ti 16GB uses 28 Gbps GDDR7 on a 128-bit bus.
28 × 128 = 3,584
3,584 / 8 = 448 GB/s
Even with very fast GDDR7, the 128-bit bus caps bandwidth under 500 GB/s.
For 7B or 13B q4 models, this is usually fine. For 34B and larger, especially at higher context lengths, it becomes a limiting factor. The 16 GB VRAM is attractive for budget builds, but tokens per second will lag behind 3090 or 4090 in memory-bound scenarios.
RTX 5090: 512-bit Bus Scaling
The NVIDIA GeForce RTX 5090 uses 28 Gbps GDDR7 on a 512-bit bus.
28 × 512 = 14,336
14,336 / 8 = 1,792 GB/s
That is roughly 1.8 TB/s.
For large quantized models, this significantly reduces memory starvation. Even if you do not use the full VRAM capacity, the bandwidth headroom improves sustained token throughput and multi-GPU tensor parallel scaling.
Clamshell Mode and 96GB: RTX PRO 6000 Blackwell
The NVIDIA RTX PRO 6000 Blackwell reaches 96 GB of VRAM using 32 memory chips while maintaining a 512-bit bus.
A 512-bit bus with 32-bit-wide chips implies:
512 / 32 = 16 memory channels
Instead of using 16 chips, this design uses two chips per channel.
16 channels × 2 chips = 32 chips
This is known as clamshell mode. Two chips share one 32-bit channel, but only one is active at a time. Capacity doubles. Bandwidth does not.
Each chip has 24 Gb density:
24 Gb / 8 = 3 GB per chip
32 × 3 GB = 96 GB total
Bandwidth is still determined only by bus width and per-pin speed. Even with 32 chips physically present, effective bandwidth remains:
(512 × memory speed) / 8
Capacity scales independently from bandwidth in this design.
For local LLM users, this distinction matters. You can increase VRAM size without increasing tokens per second unless bus width or per-pin speed also increases.
Why Bandwidth Directly Impacts Tokens Per Second
During inference, especially with transformer models, the GPU repeatedly streams weights from VRAM into tensor cores.
A 70B model at 4-bit quantization is roughly 35 GB of weights. Even with caching and layer-wise reuse, each token generation requires reading large portions of those weights.
If bandwidth is insufficient, compute units idle while waiting for data. Increasing TFLOPS does not help if memory cannot supply data fast enough.
In practical terms for local LLM users, the hierarchy usually looks like this:
- First constraint is VRAM capacity.
- Second constraint is memory bandwidth.
- Third is raw compute throughput.
This is why an older high-bandwidth card can outperform a newer mid-range card with higher advertised AI TOPS but a narrow memory bus.
What Faster GDDR7 Means for Future Builds
If future GPUs maintain current bus widths but move to 32 to 36 Gbps memory, bandwidth increases linearly.
For a 512-bit GPU at 36 Gbps:
36 × 512 = 18,432
18,432 / 8 = 2,304 GB/s
That equals 2.3 TB/s without widening the bus.
For a 256-bit GPU at 36 Gbps:
36 × 256 = 9,216
9,216 / 8 = 1,152 GB/s
That would exceed current 4090-class bandwidth on a narrower design.
VRAM Density and 3GB Chips
With 3 GB per GDDR7 chip:
8 chips = 24 GB
12 chips = 36 GB
16 chips = 48 GB
This creates a realistic path to 36 GB and 48 GB consumer GPUs without exotic layouts. For single-GPU 70B q4 inference, 48 GB becomes a practical threshold where many dual-GPU splits are no longer necessary.
Practical Buying Guidance for Local LLM Users
If you mainly run 7B to 13B models, bandwidth beyond roughly 400 GB/s matters less than VRAM size and total system cost.
For 34B to 70B q4 models, aim for at least 24 GB VRAM and bandwidth in the 700 GB/s or higher range. Used 3090 cards remain competitive because 936 GB/s still holds up for memory-bound inference.
If you are planning for larger models or future mixture-of-experts architectures, GPUs in the 1.5 TB/s and above class will age better.
When evaluating any GPU for local LLM workloads, always compute:
Per-pin speed × bus width / 8
That number often predicts real-world token throughput more accurately than marketing metrics. For inference, memory chips and bus design matter as much as the GPU die itself.
