Voozh

As LLMs have grown in popularity, the ability to run them locally has also become somewhat sought after. And it's not always easy, as the raw power required for running a lot of language models isn't something you can typically find in an ordinary laptop, and you'd probably think about using discrete GPUs first.

But companies like Qualcomm have been investing heavily in the NPU, starting with the Snapdragon X Elite chipset, and recently, Nexa AI rolled out an update for its SDK allowing various AI models and LLMs to run locally on the Qualcomm NPU. I decided to take them for a spin, and they frankly perform quite well. Relying on LLMs for any critical information is still a bad idea, but let's take a closer look.

NPUs are finally getting put to use

Models are starting to support them

The big news with this Nexa SDK update is the fact that NPUs in processors are finally starting to be used for the AI tasks most people think of these days. Whether it's LLMs or other AI tools, even since NPUs were introduced, the vast majority of them have run on the CPU or, in some cases, the GPU, which can handle very intense AI workloads, but aren't always very efficient at doing so.

With the Nexa SDK, we're finally starting to see some models specifically designed to target the NPU. That includes Nexa's own OmniNeural-4B model, which is the only multimodal offering that can run locally so far, but also dedicated versions of various other LLMs adapted to run locally on the NPU, including Llama-3B, Microsoft's Phi4-mini, or Alibaba's Qwen3-4B, which even includes a thinking model available with deep reasoning capabilities.

The NPU-based models are still somewhat limited, though. Of the options available in Nexa's model hub, the best offerings we see in terms of parameters hit around 4 billion, but you can deifnitely get more powerful models if you have more capable hardware, specifically in terms of having large pools of RAM available. As a concrete example, the Qwen3 model designed for the NPU hits that 4 billion mark, but a quick look at HuggingFace reveals a version that supports a whopping 235 billion parameters, as well as a smaller 30-billion-parameter version designed to run on GPUs, and with a lot more memory available. But we can hope that more powerful models will become available as NPUs also progress, especially now that we've seen the Snapdragon X2 Elite can nearly double the performance of the current generation.

Using OmniNeural-4B

A multimodal model that runs locally

OmniNeural-4B is arguably the most capable model designed to run on Qualcomm's Hexagon NPU, considering it's multimodal and supports text, image, and audio input.

I asked it a few random questions, and it generated responses fairly quickly, even more so than what I typically expect from the free versions of LLMs on the internet. You can clearly see a major spike in the NPU usage whenever a query is running, showing that it is, in fact, making full use of the power of this NPU, which will be even more interesting to analyze once the Snapdragon X2 Elite hits the scene with an NPU that's nearly twice as capable.

It can even interpret images decently well. It correctly identified the game Kirby Super Star from a screenshot, though when I used a screenshot of Sonic the Hedgehog 2, it was identified as the original Sonic the Hedgehog.

Audio recognition also works quite well, and again, you can see the NPU work when you ask the model to process an audio file, though the fan on my laptop didn't even spin up. I recorded some of my thoughts on Windows updates compared to macOS updates, and the model was able to repeat my opinion back at me in a more succinct way, which was nice to see.

Other models are supported

There's even a thinking model

While Nexa develops the OmniNeural model specifically for Hexagon NPUs, a few other models from different companies have also been adapted by Nexa AI to run on that NPU, so you can use other popular models like Llama-3B from Meta or Microsoft's Phi4-mini.

Both of these are text-only models, though they at least seem to be more up to date and helpful in terms of the information provided. For example, I asked both OmniNeural-4B and Llama-3B who is the president of the United States and how old they are. Both claimed Joe Biden is still president, but while OmniNeural-4B said he was 77, Llama-3B said he was 81 and specified that the information dated back to late 2023. It also provided me with a birth date I could use to calculate the current age myself.

Another model I found interesting is Nvidia's Parakeet, which is mostly used for speech recognition, meaning it can transcribe audio files. It also runs on the Snapdragon NPU, and it works decently well. I fed it the same recording about Windows updates, and it mostly matched everything I said. There were some errors, but my pronunciation is also far from perfect, and you can easily glean what's being spoken.

Even some advanced reasoning models are available to run locally, specifically Qwen3-4B-Thinking, originally developed by Alibaba and adapted by Nexa to run on the Hexagon NPU. This one is particularly interesting to observe in real-time, as it pushes the NPU extensively. For close to a minute, I was seeing constant NPU usage above 95% in the Windows Task Manager as I saw a response being generated before my very eyes. It didn't feel like it was being slowed down or anything, either, and while I could hear the fan if I listened more closely, it didn't feel like the laptop was getting too hot, which is a testament to the efficiency of the NPU.

Local LLMs are interesting

I've been fairly clear in expressing disinterest in AI-adjacent features, but if you're interested in LLMs at all, a solution to run them locally on NPU seems like one of the best ways to go about it. The models in the Nexa SDK run fairly quickly, and they're extremely efficient seeing as my laptop still stays quiet while running them. With one of the main criticisms against AI being its massive energy use, these power-efficient local implementations are a good way to combat that. Personally, I'm curious to see what variants of these models might be unlocked with the next generation of NPUs considering Qualcomm is promising 80 TOPS of performance in the Snapdragon X2 Elite.

URL: https://www.xda-developers.com/these-llms-run-locally-snapdragon-x-elite-npu-surprisingly-good/