Google Says the Quiet Part Out Loud: LLM Inference Is Starved by Memory
Google recently published a hardware-focused paper that says the quiet part out loud: modern LLM inference is bottlenecked by memory bandwidth and memory latency, not compute. This is not news to anyone running models locally, but the paper matters because it confirms this at the datacenter scale and explains why GPUs keep getting faster while real-world token generation barely improves.
The paper focuses almost entirely on the decode phase of inference. Prefill behaves like training and scales well with compute. Decode is autoregressive, one token at a time, and each step repeatedly pulls weights and KV cache from memory. That makes decode fundamentally memory-bound. Faster FLOPS do very little if memory cannot keep up.
Google’s core point is simple: current GPUs and TPUs were never designed specifically for LLM inference. They inherited training-first designs with massive compute units paired to increasingly expensive High Bandwidth Memory. Compute performance has grown roughly 80x over the last decade, while memory bandwidth has grown closer to 17x. That gap is still widening.
What New Information Does Google Add?
The novel contribution is not identifying the bottleneck. Everyone already knows that. The value is in how clearly Google frames memory cost, memory scaling, and system-level inefficiency as the long-term blocker.
The paper highlights something many enthusiasts feel indirectly but rarely quantify: HBM is getting more expensive per GB and per GB/s over time, while standard DDR keeps getting cheaper, or at least it was two months ago. This is backwards for inference workloads that want capacity and predictable bandwidth more than peak throughput. The implication is that future GPUs will be even worse value for inference unless the architecture changes.
👁 high-bandwidth memory diagram
Google also points out that SRAM-only designs failed. Cerebras and Groq tried avoiding external memory altogether, but LLMs outgrew on-chip SRAM quickly. External memory is unavoidable.
Do They Propose Actual Solutions?
Yes, but they are architectural directions, not shipping products.
The first proposal is High Bandwidth Flash. This is essentially flash memory stacked like HBM, trading write endurance and latency for massive capacity. Google suggests using it for frozen weights and slow-changing context, not KV cache. The key claim is 10x memory capacity per node, which would drastically reduce the number of accelerators needed for large models.
The second idea is Processing Near Memory, not classic Processing In Memory. Google is very explicit here. Putting logic inside DRAM dies causes power, thermal, and software problems. Instead, they argue for separate logic dies placed physically close to memory. Larger shards, simpler software partitioning, and better performance per watt.
The third direction is 3D memory-logic stacking. This shortens the data path between memory and compute using vertical connections instead of board-level traces. The bandwidth gains are real, and the power savings matter for inference, which has low arithmetic intensity anyway.
The fourth area is low-latency interconnects. Inference increasingly spans multiple chips, and decode uses small, frequent messages. Latency matters more than raw bandwidth. Google argues that current interconnects are optimized for training, not inference.
Is This Relevant to Local LLM Users?
Directly, no. Indirectly, very much yes.
This paper is written for datacenters. It assumes custom ASICs, advanced packaging, and budgets that do not exist at home. There is no consumer hardware roadmap hidden inside it.
But the conclusions strongly validate local inference reality. Memory bandwidth dominates tokens per second. Capacity dictates whether a model runs at all. Multi-GPU systems exist because memory is fragmented and expensive. Decode speed barely improves generation to generation because memory scaling is slow.
The paper also quietly undermines the idea that bigger GPUs automatically mean better inference. If memory cost and bandwidth do not scale, performance per dollar will get worse, not better. That aligns with why older cards with wide memory buses (think RTX 3090) often age surprisingly well for local LLMs.
What This Means Going Forward
Google is effectively saying that inference-friendly hardware does not exist yet. The industry optimized for training, and inference inherited the leftovers. Fixing this requires rethinking memory first, compute second.
For local users, the takeaway is not to wait for Google’s solutions. It is to keep optimizing around memory. Wider buses, more VRAM per dollar, multi-GPU setups, lower clocks with stable bandwidth, and realistic expectations for decode speed all remain the winning strategy.
However, the reality looks different, at least as of January 2026. GPUs are getting more expensive, not cheaper, and the ongoing memory supply pressure is not easing the problem. NVIDIA is clearly prioritizing its enterprise and datacenter business, while consumer releases continue to offer little improvement in usable memory capacity or bandwidth for inference.
At the same time, some of the last consumer budget GPUs that actually made sense for local LLM workloads, such as the RTX 5060 Ti 16GB and the RTX 5070 Ti class of cards, are quietly being phased out or replaced with configurations that offer less value for memory-bound inference. The result is a widening gap between what local LLM users need and what the mainstream GPU market is willing to ship.
Read more
No comments yet.
