Voozh

If you'd asked me a couple of years ago which machine I'd want for running large language models locally, I'd have pointed straight at an Nvidia-based dual-GPU beast with plenty of RAM, storage, and processing power. That was the safe answer, of course. CUDA has been around for a long time with a thriving ecosystem, and much of AI has been built off the back of it. Paired with a ton of VRAM (such as the RTX 3090, which has 24GB of VRAM), it would have been the best option most consumers could even fathom.

However, it appears that the tides are beginning to change. Not only has Google's Gemini 3 model been trained on the company's own TPUs, but I've been using a MacBook Pro with Apple's M4 Pro chip for almost a year, and in that time, I've come to find the company's MLX framework a powerful tool for local large language model deployment. Plus, as someone who uses an AMD Radeon RX 7900 XTX in my home server machine and has watched AMD's HIP (the company's answer to CUDA) mature, it appears that, industry-wide, Nvidia's dominance is slowly coming into question. Google's TPU advancements show that Nvidia isn't needed in the pipeline, and Google is now lending that hardware out to Anthropic, too.

In the consumer space, Apple has quietly built a vertically integrated stack for local AI, with hardware, memory architecture, and a first-party framework all giving it a very real, very practical advantage for on-device LLMs, while making it clear that you don't need Nvidia for high parameter models and fast inference. This is further underpinned by a study conducted by India-based Persistent Systems, which states the following:

Although Apple Silicon inference frameworks still trail NVIDIA GPU-based systems such as vLLM in absolute performance, they are rapidly maturing into viable, production-grade solutions for private, on-device LLM inference.

The most interesting part is that this advantage doesn't come from the marketing bullet points Apple usually pushes, such as the company's "Neural Engine." Instead, it comes from a much less flashy combination of unified memory, a slim machine learning runtime called MLX, and some surprisingly capable tooling around it. Even better, MLX is open-source and integrated in LM Studio, so you don't need to be an engineer to make use of it.

Are Nvidia's cards still the best for local inference? Yes, by far. But for normal people, Apple's MLX might be the most accessible, and it's no good being the best if a person can achieve nearly the same experience with a laptop that costs the same price as one GPU.

What is MLX? How does it differ from GGUF?

MLX is a framework, GGUF is a model format

Apple describes MLX as "an array framework for machine learning on Apple silicon," with a NumPy-like API and a simple, research-friendly design. That makes it sound like just another PyTorch competitor, but that framing undersells what's going on. Under the hood, MLX makes use of the following unique properties of Apple Silicon:

It uses the unified memory architecture of Apple Silicon
It uses Metal as the execution backend and is tuned for M-series CPU, GPU, and (on newer chips) neural acceleration paths
It keeps the Python surface area small and pushes heavy computation onto a C++ core with graph optimizations, kernel fusion, and automatic differentiation

As a result, you get a fairly familiar array API with functions that are intuitive to use for developers who have used NumPy. Not only that, but you also get a higher-level neural network and optimizer modules, with minimal overhead compared to bigger, more general-purpose frameworks.

Apple's own documentation pitches MLX as being built "by machine learning researchers for machine learning researchers," with an emphasis on being easy to extend and experiment with. That's why, even though you can absolutely train and deploy models with it, the early momentum around MLX has largely focused on running and tweaking existing LLMs locally.

Apple's value pitch when it comes to MLX boils down to a declaration of superiority over the community-driven GGUF. The company is basically saying that if you want peak performance on Mac, you should use their format, their runtime, and their execution engine. GGUF, meanwhile, isn't a framework, but a universal, portable model format. It's hardware-agnostic, community-standardized, and constantly updated to support new model architectures.

In other words, GGUF is just the model, and you can execute a GGUF-packed model in a variety of ways. However, MLX-type models require the entire MLX-based runtime to actually execute them. MLX isn't just limited to text generation either, and can be used for Whisper, Stable Diffusion, and so much more.

Unified memory is Apple's real advantage

VRAM is the same as RAM

Source: Unsplash

If you've tried to run LLMs locally on a traditional gaming laptop, or even just a PC with a weaker GPU, then you'll likely already know the limitations of running a local model. Both VRAM capacity and VRAM bandwidth are incredibly important, but even an 8B or 14B model with decent precision requires more VRAM than a lot of GPUs actually have. On top of that, once you add your key-value cache for conversation context, you don't have a whole lot of room left to work with.

Apple Silicon flips that model on its head, and as do some mini PCs that have arrived in recent years off the back of Apple's dominance in this area. M-series chips use a unified memory architecture, where there aren't two separate pools of memory. This means you don't have a "system RAM" pool which is separated from a "GPU VRAM" pool. Instead, the CPU, GPU, and dedicated accelerators all share the same pool of high-bandwidth memory, connected over a wide on-package fabric. There are advantages and disadvantages to this approach, but when it comes to local LLM deployment, it's basically all advantages, and MLX uses the architecture to its fullest.

With MLX, there are two big architectural advantages that it especially makes use of. First, arrays live in unified memory and can be executed on CPU or GPU without explicit "copy to device" or "copy back" calls. As well, you don't have to juggle separate CPU tensors and GPU tensors or worry about accidentally blowing up VRAM, as you're only constrained by total system memory rather than the limited VRAM capacity of your GPU.

For example, a 24GB RAM MacBook Pro is fairly standard (Apple's RAM pricing aside, which doesn't seem as extreme these days), but that RAM is closely comparable to a GPU's 24GB VRAM. Imagine if your PC's pool of RAM could also serve as high-bandwidth memory for an LLM? That's what Apple has basically achieved with its unified memory, and MLX uses it to the fullest. These same properties are also what enables you to deploy huge models on Apple Silicon, like DeepSeek's big 671B model on the 512GB RAM Mac Studio.

MLX LM is Apple's bridge to the LLM world

It utilizes the hardware in the right way

On its own, MLX is just an array library. The interesting bit that we care about is MLX LM, a Python package that sits on top of MLX and turns it into a very approachable LLM runtime. It integrates with Hugging Face so that you can download and run models with simple commands, while also supporting built-in quantization and fine-tuning. And that includes LoRA and QLoRA style low-rank adaptation, too.

MLX has garnered quite a bit of support from Hugging Face, LM Studio, and other developers looking to integrate their software with the framework. You can filter for MLX-compatible models on Hugging Face, browse the official mlx-community hub for models tuned for Apple Silicon, and run them from within LM Studio.

In other words, you're not stuck in some Apple-sanctioned ecosystem when using MLX. Most popular models can be found in MLX format, including the likes of Qwen, Mistral, any of the various Llama derivatives, and even GPT-OSS, and you can run them locally on an Apple Silicon machine. Of course, you need to make sure the model fits in memory, but so long as it does, you can benefit from the performance gains of MLX.

Credit: Source: Apple

Apple made several improvements to MLX starting with the M5, and there are some rather big leaps made from a hardware point of view that will improve things significantly. The company focused pretty heavily on GPU neural accelerators with dedicated matrix mulitplcation units, which MLX can target. Apple, somewhat silently, published a document comparing local LLM performance between a 24GB M4 MacBook Pro to a similarly configured MacBook Pro running an M5 chip instead. There are some notable improvements:

Time-to-first-token (TTFT) improved by at least three times across the board, comparing Qwen 1.7B, 8B, 14B, GPT-OSS 20B, and a Qwen 30B MoE when run via MLX LM
Subsequent token generation speeds improved by between 19% to 27%, which maps very closely to the M5's roughly 28% higher memory bandwidth (153GB/s vs 120GB/s)

The split makes it pretty clear where the bottlenecks are, and what MLX improved. The first token is very often compute-bound, and the neural accelerators offer a significant speed-up. The rest is fundamentally memory bandwidth related, and that's why it scales linearly with an increase in memory bandwidth. More memory bandwidth means more calculations, and a local LLM will gladly use those resources if and when they're available.

As I already mentioned, though, I'm on an M4 Pro-based MacBook, so those M5-specific performance uplifts don't apply to me. Even still, all of this matters because it signals Apple's direction of travel. MLX is being actively tuned to take advantage of new hardware features as they appear, future chips are likely to keep adding more specialized compute blocks for exactly this type of workload, and because MLX is Apple's own framework, it'll be first in line to benefit from those changes.

While Nvidia may be the top dog in the AI space, companies like Google and Apple are starting to make moves in the hardware side of things, but with unique approaches that don't involve just throwing more compute at the problem. I don't think Apple will come anywhere close to that level of dominance or competitiveness in terms of enterprise-targeted AI hardware, but the point remains that the advancements in this area serve to break down the monopoly Nvidia holds over the AI industry. What Apple has shown is that you don't get the kind of tight, one-vendor optimization loop that it benefits from with a generic CUDA stack that has to support almost a decade of cards and configurations. And similar can be said of Google's approach to training models, too.

CUDA is still the king of raw throughput, especially on desktop. The ecosystem is deeper and has been around for a lot longer, and developers looking to train models or fine-tune models will look to high-end Nvidia cards, not a Mac Mini or a MacBook. But that's exactly why I think the company has a sleeper advantage. Apple isn't trying to replace your A100; instead, it's making a strong argument that a thin, quiet laptop can be good enough for a surprising range of local LLM workloads, like coding assistants, document analysis, offline chatbots, and lightweight fine-tuning, with almost zero configuration.

And all of that is more exciting than yet another $2,000 GPU that pulls 700W under load.