Voozh

There's a lot of development going on in the AI space right now, but a lot of that development is in ways that typical consumers may not care about. However, a lot of what Arm is doing in the space right now is some of the most interesting developments out there that consumers should care about, as it's being designed in a way to safeguard privacy and be useful.

We sat down with Ronan Naughton, Director of Product Management in Arm and responsible for the development of AI workloads on CPU, to discuss some of the work that Arm has been doing over the last couple of years.

3 It's cross-platform

All thanks to Kleidi

Source: Arm

Arm first unveiled its Kleidi AI framework earlier this year alongside its Compute SubSystems (CSS) package. Arm wants to bring AI to the widest possible market, and succeeding in that goal means being able to bring AI to as many devices as possible. That's also why Arm wants to run AI workloads on CPUs, rather than on GPUs or NPUs, something that the rest of the industry is doing.

At first glance, it may sound strange, but it makes a lot of sense. Arm wants developers to be able to "develop once, test once, deploy everywhere," and running on a CPU enables that. Otherwise, developers need to account for the specific differences between NPUs that are all being developed differently, and GPUs that are all being developed differently, too.

Naughton also told me that development with Kleidi enables developers to deploy on Windows, Android, Linux, anywhere really, as the libraries themselves are mostly written in assembly with simple C/C++ calls to utilize the library. That means that Kleidi will run pretty much anywhere, and that also means that there are very few overheads. In that sense, it's a bit like how LM Studio enables you to download and run a model basically everywhere.

2 It's focused on practical use cases

Features you might actually use

Source: Arm

Arm showed off three demos, and all three of them are things that consumers might actually care about. The first is the least useful in my opinion, and it's a chatbot that runs locally. We've all seen that before, and MLC Chat is a quick and easy way that you can get an LLM deployed locally on your smartphone. In the video, Arm demonstrated how Llama2-7B could be deployed on an existing Android phone using 3x Arm Cortex-A700 series CPU cores, at a token generation rate of 9.6 tokens per second. Smaller models are even faster.

However, things get even better and more interesting very quickly. If you're not interested in local chatbots, OpenAI's Whisper model can also be deployed locally to transcribe voice messages that the user receives. The chatbot can then summarize the transcription into bullet points so that if you can't listen to a voice message, you can see what was said and still be able to reply. LLMs hallucinate, but the transcription is readily available, so you can sanity-check anything in the voice message from the transcription if you need it.

Finally, a feature was also demonstrated that could summarize a group chat. Have you ever been added to a group chat, and only seen it a few hours later after a few hundred messages were sent? The idea is that you could have the LLM sum up everything that happened for you so that you can get all of the main points of information. From there, you can then get a quick idea of what the group chat is about, and you can quickly ask the other participants if the information is correct.

The power of a chatbot and an on-device assistant is fantastic when paired with a voice assistant. "I mean, you know, speech is 10, 12 tokens per second, right? So if it's a voice assistant, you know, you don't need to get 60 or 70 tokens per second, which means that a voice assistant is very, very portable to many, many tiers of devices" Naughton tells me.

That focus on practicality extends to how Arm thinks about the deployment of these models on those many tiers of devices. "So you might have an inference on device where it's taking static data or if you want real-time data. So let's say, what's the capital of France? It's Paris. Can you name a landmark there? Oh, it's the Eiffel Tower, right? That's all happening on device. What's the temperature there right now? That's real-time data." A harmony of cloud-based models and on-device models is the future of AI and is already happening with the likes of Copilot+, Samsung AI, and Apple Intelligence.

👁 Two blue jays on a building generated in stable diffusion

Best AI applications: Tools that you can run on Windows, macOS, or Linux

If you want to play with some AI tools on your computer, then you can use some of these AI tools to do just that.

By Adam Conway

1 Good performance and ease of use are paramount

Developers don't need to compromise

Naughton explained to me that one of the biggest problems when it comes to power draw from an LLM on-device isn't actually the computation, but rather the DRAM power or bandwidth. "The limit is not necessarily the compute power. It’s the DRAM power or DRAM bandwidth," he tells me, which is also why it's able to run on older devices and not just the very best of the best. Yes, there will be a limit to what those devices are capable of, but the limit is found in DRAM sooner than it's found in CPU power.

"A [7B large language model] takes up 3.5 gigs of RAM and pulling all that data from DRAM for an inference, for an encode can take a lot of energy. So that's why we see the trend towards more three billion and two billion and even one billion parameter models as models get smaller and as the accuracy continues to improve, the energy consumption is going to come down because you're pulling less data from DRAM," he points out. In other words, performance isn't really an issue as models improve.

However, developers will also find it easy to make use of Kleidi, which means they may be more likely to implement models on-device. "We have made [Kleidi AI] so lightweight and well-coded, well-documented, good integration and guides. And so that can plug into any framework and be transparent to the app developer. The app developer doesn’t need to know about it."

Finally, Naughton also tells me that they're building Kleidi AI for classic ML, not just generative AI and LLMs, so even if you don't find yourself interested in generative AI, there are still benefits that you'll be able to receive down the road. Arm is already working with Google as well to bring Kleidi to devices, with Kleidi now being supported by Google's XNNPACK library. XNNPACK is a library that accelerates inference operations on Arm CPUs, along with support for WebAssembly and x86.

👁 The Vicuna-7B model running on a Samsung Galaxy S23 Ultra, showing the power of on-device AI

You can run local LLMs on your smartphone, here's how

If you have any kind of recent smartphone, you can run a local LLM on it.