For decades, upgrading a PC was straightforward; if something felt slower, a faster CPU upgrade was usually the solution. And in most cases, it worked, since most of the earlier tasks were CPU-intensive. As neural processing units (NPUs) have gradually become standard consumer hardware, I assumed AI performance would also scale the way traditional computing always has. Buying a new “AI-ready” processor will solve all the issues, and my PC will be smarter and AI-ready.
What became clear very quickly is that this assumption doesn’t really hold. In practice, AI workloads are rarely limited by the processor; more often, they are bottlenecked by factors such as memory bandwidth.
4 ways an on-chip NPU makes your PC better right now
Your next laptop might include an NPU built-in. You won't see it, but it quietly changes how your OC runs, especially behind the scenes.
The assumption I started with
AI PCs sounded like a simple upgrade
CES concluded this year with a strong, unmissable message that AI PCs are here. New platforms arrived with impressive on-paper specifications, including massive AI performance and dedicated NPUs. I have seen silicon like the Snapdragon X2 Elite Extreme with top-tier on-paper specifications such as 4000+ single-core and 23000 multi-core scores (Geekbench), including 80 trillion operations per second (TOPS) NPU.
NPUs have quickly become a basic requirement for a modern PC, and the “Copilot+” stickers are everywhere, much like the old "Intel Inside" logo once was. The assumption that followed was natural, and I shared it too. Upgrading a PC in the AI era should work the same way it always has. If I want a faster PC with better AI performance, I would just buy a processor with more TOPS.
But in practicality and real-world usage, it is quite the opposite, and the logic starts to fall apart. AI performance doesn’t scale up the way CPU performance once did, and upgrading the processor alone won't help cause a noticeably better experience.
Why AI marketing and real-world usage don’t line up
AI performance stalls long before compute runs out
We have all seen the marketing campaigns from various OEMs for their AI-ready PCs. The most highlighted AI features include live transcription, background noise removal, and small local language models running on the device.
These features on paper are exactly the kind of tasks that would benefit from a faster processor and dedicated AI hardware. The problem appears when we run these features alongside our normal activities, like multiple tabs open, browser tabs loaded, and media playing in parallel. In those situations, AI features don’t feel as instant as the marketing suggests.
For example, Whisper from OpenAI is the most commonly used transcription utility today. It can run real-time transcription on a mid-range CPU without any trouble. Upgrading to a faster processor doesn’t meaningfully reduce the end-to-end latency and make transcription better because I/O and decoding dominate the latter stages of the process.
What matters more than the raw compute power is the disk speed, audio preprocessing, and how the workload is split across threads. Once real-time transcription is achieved, additional CPU performance goes unused.
AI inference is a memory bandwidth problem
Why moving data matters more than crunching it
A similar pattern appears when running a local language model. In this case, performance is less about processor speed and more about how quickly the system can move data. When running a local large language model (LLM) like LLaMa or Mistral, the token generation speed is often limited by memory bandwidth and cache behavior, and not clock speed.
Traditional software is compute-bound. A faster CPU can hold a piece of data in its high-speed cache and perform repeated operations efficiently. AI inference works very differently; to generate a single output, it has to read the whole library of billions of parameters from our system memory. That data can't realistically fit into cache.
As a result, the CPU or NPU spends only a fraction of its time computing, and the rest waits for the RAM to deliver the next set of data to work on. This is why memory becomes the hidden limiter for AI performance.
Where my PC fits into this picture
Even high-end systems hit the same wall
I see this behavior clearly on my own PC. For context, my PC runs a Ryzen 7 7700X, an RTX 4070 Ti OC, and 32GB of DDR5 memory. By any means, it is not a low-end system. For most everyday tasks, it feels extremely fast, whether it is gaming, multitasking, or content work. Local AI tools work well. But performance doesn’t scale once AI workloads become more demanding.
I notice this most when working with a local LLM, and the context window starts growing. Responses are generated accurately, but they are often slow and sometimes incomplete. When I work on a project, I usually brainstorm ideas on a local LLM before moving to a cloud-based provider. For that, I often use Jan, running the Llama-3.1-8B-Instruct-IQ4-XS model.
With a simple prompt, the model responds very quickly, with no observed delays. This is what OEMs market about their “on-device AI.” Things change when the prompt is more complex. If I ask the model to reason through a multistep problem like designing a document compression pipeline, the response is accurate, but it takes longer, and the tokens per second degrade.
It becomes even more obvious when I follow up with a more constraint-heavy revision. Response times increase further. GPU utilization stays relatively flat, but memory pressure increases; both VRAM for the model weights and system RAM if the model or context spills over. In my case, running an 8B model with a growing context window eventually exceeds what fits comfortably in my GPU's 12GB of VRAM, forcing partial offloading to system memory, which is significantly slower.
Importantly, this isn't a lack of processing power. There's still plenty of headroom available. The model lags because it has to repeatedly move more data through memory as the context grows. This is when data movement becomes the limiting factor, and memory access turns into the bottleneck.
7 things I wish I knew when I started self-hosting LLMs
I've been self-hosting LLMs for quite a while now, and these are all of the things I learned over time that I wish I knew at the start.
The AI PC checklist I care about now
What I’d prioritize before upgrading today
As AI workloads have become more common, and as I’ve seen how local AI behaves in practice, the way I evaluate a PC upgrade has changed entirely. Raw processor speed is still required, but it is no longer a deciding factor for AI responsiveness anymore. The first thing I’d look for is a dual-channel memory configuration, since memory bandwidth matters more.
In high-end laptops, the soldered LPDDR5X, often running at 7500 MT/s or higher, provides the kind of bandwidth that NPUs rely on.
This doesn’t mean that we skip the NPU entirely. They are designed for sustained inference workloads, lower power consumption, and predictable latency under load. NPUs do not replace CPUs or GPUs; they complement them. Finally, I would pay more attention to a well-balanced PC than to headline specs.
The real takeaway — balance beats benchmarks
The AI PC era exposes weaknesses in system design faster than any workload before it. A faster processor alone doesn’t guarantee a better and more responsive AI experience. What matters now is balance. A well-balanced machine with sufficient memory bandwidth can outperform a faster but poorly configured one, not because it computes more, but because it moves data efficiently.
