Voozh

Intel just launched the Arc Pro B70, and on paper, it looks like a local AI dream. You get 32GB of GDDR6 memory, 256 Xe Matrix Extensions engines, and PCIe 5.0 support, all for $949. That's the same VRAM capacity as an RTX 5090, albeit GDDR6 instead of GDDR7, at less than half the price. If you're someone who runs local LLMs and has been watching Nvidia's pricing climb year after year, that probably sounds almost too good to be true.

Unfortunately, it kind of is... at least for now. The hardware is real, and there are ways to make it work for local AI, but the experience getting there makes it harder than it should be. I've tested local LLMs on an Intel Arc A770, and it doesn't help that Intel hasn't really worked out its AI strategy yet when it comes to local tooling. Even its AI Super Builder, available on some Intel laptops, felt like a half-baked experience. The Arc Pro B70 should be a good card for local AI, but Intel's current software stack will make you work for it.

The hardware makes a strong case on paper

32GB of VRAM for under a grand is hard to argue with

Credit: Intel

Before we get into the details, there's one thing it's important to be clear about: the Arc Pro B70's raw specs are hard to fault. At $949, you're getting 32GB of GDDR6 on a 256-bit bus with 608 GB/s of memory bandwidth. Intel is positioning this against Nvidia's RTX Pro 4000 Blackwell, which costs around $1,800-$2,000 and only offers 24GB. The B70 gives you 33% more VRAM at roughly half the price.

For local LLM inference, VRAM is arguably the most important spec. It determines how large a model you can load entirely into GPU memory, and running a model from VRAM is dramatically faster than spilling over into system RAM. With 32GB, you can run heavily quantized 70B parameter models, or fit something like Qwen3.5-27B at a decent quantization. 32GB of VRAM is a lot, and it's a step above what most other high-end GPUs on the market can do right now.

On top of that, Intel also claims up to 2.2x larger context windows compared to the competition, and up to 6.2x faster responses in multi-user workloads. Those numbers come from Intel's own llm-scaler benchmarks running vLLM, and they look great if you're deploying in that specific environment. But that's a big "if."

The reality is that there's more to local AI than the hardware... and in Intel's case, the software is particularly underdeveloped.

Intel's AI software stack is a mess for regular users

Enterprise tools don't help hobbyists

I want to be fair to Intel here, because the narrative that "nothing works on Intel GPUs" isn't quite as accurate as it once was. Some of the foundational pieces are actually in good shape these days.

Intel's XPU backend has been upstreamed into mainline PyTorch, meaning any Arc GPU can run PyTorch workloads natively without a separate extension. vLLM also has an official XPU backend that supports Intel Arc GPUs, and the performance numbers can be genuinely good. In my research, I found a benchmark that showed dual Arc B580s hitting 83.5 tokens per second on a 20B model through vLLM and XPU, compared to just 15 tokens per second through llama.cpp on the same hardware using Vulkan. When I used ipex-llm on my Arc A770 to run local models through a custom Docker setup, the performance was pretty decent too, and that was before many optimizations that Intel has made.

But, and this is a big but, getting vLLM running on Intel hardware is an ordeal. You need a specific Python version (3.12, nothing else), a specific Ubuntu version (24.04.3, according to testers), careful library path management, and a wrapper script to handle multi-GPU setups. As the person who ran those benchmarks put it: "Installing for Intel XPU backend is really hard in my opinion. I just don't think we are there yet." The performance is there once you clear the setup hurdles. The problem is that the hurdles are high enough to turn away most people who aren't already comfortable debugging library path conflicts.

Then there's the community side. A project called OpenArc has quietly become one of the better ways to run local AI on Intel hardware. It's an inference engine built on OpenVINO that serves LLMs, vision models, Whisper, and TTS over OpenAI-compatible endpoints, and it supports Intel GPUs, NPUs, and CPUs. It sidesteps the whole SYCL mess by using Intel's own OpenVINO runtime, which is arguably the one part of Intel's AI stack that actually works well. OpenArc supports multi-GPU setups, speculative decoding, and runs models like Qwen3 and Gemma-3 out of the box. It's the closest thing to a clean "install and go" experience that exists for Intel GPUs right now, and it's being built by the community, not Intel.

Intel pulled the rug out

Don't get used to Intel's software

What really holds Intel back isn't that the software doesn't work, but the company itself can't seem to pick a path and actively commit to it.

For example, when I tested the Arc A770 for local LLMs, I used ipex-llm through a custom Docker setup. It worked, and the performance was decent too, netting approximately 23 tokens per second on Qwen3's 14B parameter model. Having said that, getting there was painful. You needed specific Docker images, particular environment variables, and more patience than most people are willing to give it. Especially when compared to the setup process on Nvidia or even on AMD.

To make matters worse, since then, Intel has entirely archived the ipex-llm repository as of January 2026, citing "known security issues." That repository is now read-only. Intel's own suggested path forward is llm-scaler, a vLLM-based solution that runs through Docker containers, but even that only recently added support for the Arc Pro B70, and previous consumer GPUs that worked with ipex-llm aren't supported at all. When users asked Intel directly whether llm-scaler replaces ipex-llm for consumer GPUs like the A770 or B580, the answer was essentially "not yet." If you're a hobbyist with a single GPU who just wants to run Ollama, Intel doesn't have an official answer for you. You can certainly try running llm-scaler with one of those older GPUs, but there's no guarantee that it will work.

Unfortunately, it's hard to trust Intel here. It's not quite a pattern yet, but it's a frustrating experience for end-users when Intel builds a tool, gets people invested in it, and then archives or pivots away from it before the ecosystem has a chance to mature. It makes it hard to trust that whatever Intel recommends today will still be supported in a year.

Funnily enough, mere hours ago (at the time of writing), Intel archived the Intel Extension for PyTorch GitHub repository, tacking on the same security notice that they also attached to ipex-llm. The XPU backend is in PyTorch already, and the extension isn't strictly needed anymore, but it's a weird look that two of its AI-related repositories were archived for security reasons in the span of just a few months.

The Ollama problem

The tool most people actually want doesn't work

Credit: Ollama

For all the progress with PyTorch and vLLM, there's one glaring gap that matters more than anything else for most local LLM users: Ollama still doesn't have proper native Intel Arc GPU support.

Ollama is, for better or worse, how a lot of people interact with local LLMs for the first time. It's the tool that makes running models as simple as ollama run llama3. On Nvidia, it just works. On AMD with ROCm, it just works. On Intel, it doesn't.

There have been multiple pull requests over the past two years attempting to add SYCL backend support to Ollama, and while some early work was merged back in 2024, it was incomplete. The most active PR has been open since June 2025 with reports of models still falling back to CPU despite the SYCL backend loading. Ollama does have experimental Vulkan support now, which can technically work with Arc GPUs on Windows and Linux, but it's not the polished experience you'd get with CUDA or even ROCm, and as we've shown, performance is significantly worse when comparing XPU with Vulkan. You can also use Intel's custom Docker image with a compiled version of Ollama, but that's yet another layer of friction.

The llama.cpp situation is similar. The SYCL backend exists and technically works, but performance benchmarks have shown it hitting roughly a third of the theoretical memory bandwidth in some configurations. For comparison, Nvidia GPUs typically achieve 90% or higher in the same workloads. Flash attention support was added to the SYCL backend in March 2026, which is a step forward, but it's still buggy on certain Intel architectures and produces corrupted output in some cases. Recent optimizations have improved MUL_MAT performance depending on the GPU, so progress is being made, but the moat between Nvidia and Intel still exists, and there hasn't been much done to narrow the gap.

If you're willing to use vLLM or OpenArc and handle the setup complexity, Intel's hardware can deliver real performance for the money. But if you want the experience that most people expect from local AI in 2026, where you install a tool and point it at your GPU, Intel isn't there yet. AMD's ROCm, which was the punchline of GPU software stacks for years, now has native Ollama support that just works. You install Ollama, it detects your AMD GPU, and you're running models. Intel can't match that simplicity, and until it can, most people are going to have a bad time.

Worth watching, not worth buying yet

Unless you're willing to do the work

I want the Arc Pro B70 to be good for local AI. 32GB of VRAM at $949 undercuts Nvidia by a wide margin, and the silicon itself can calculate tokens quickly when the software cooperates. Projects like OpenArc and the vLLM XPU backend prove that this hardware can deliver. And Intel deserves credit for the PyTorch work, as that's real foundational progress that will hopefully pay off way down the line.

But Intel keeps getting in its own way. It archives the tools people rely on, limits the replacements to specific hardware, and leaves the community to fill the gaps. The fragmentation of software is the real problem: ipex-llm (archived), llm-scaler (limited GPU support), SYCL in llama.cpp (improving but slow), experimental Vulkan in Ollama, OpenArc (community-built on OpenVINO), and vLLM XPU (works but painful to set up). There's no single, clean path from "I bought an Intel GPU" to "I'm running a local LLM" like there is on Nvidia... or even AMD, at this point.

If you're technical enough to set up vLLM or OpenArc and you want 32GB of VRAM without paying Nvidia prices, the Arc Pro B70 might actually be worth considering. But if you want the easy path, where Ollama detects your GPU and you're running models in five minutes, Nvidia's CUDA ecosystem is still the only one that truly delivers that. From my experience with the 7900 XTX, AMD's ROCm is actually pretty close behind Nvidia from a plug-and-play perspective.

But Intel? It's not even in the picture yet, and it's through no fault of the hardware. Intel just won't get out of its own way.

URL: https://www.xda-developers.com/intel-gpu-32gb-vram-local-ai-software-nvidia-keeps-winning/

⇱ Intel's $949 GPU has 32GB of VRAM for local AI, but the software is why Nvidia keeps winning

The hardware makes a strong case on paper

32GB of VRAM for under a grand is hard to argue with

Intel's AI software stack is a mess for regular users

Enterprise tools don't help hobbyists

Intel pulled the rug out

Don't get used to Intel's software

The Ollama problem

The tool most people actually want doesn't work

Worth watching, not worth buying yet

Unless you're willing to do the work

URL: https://www.xda-developers.com/intel-gpu-32gb-vram-local-ai-software-nvidia-keeps-winning/

⇱ Intel's $949 GPU has 32GB of VRAM for local AI, but the software is why Nvidia keeps winning

The hardware makes a strong case on paper

32GB of VRAM for under a grand is hard to argue with

Intel's AI software stack is a mess for regular users

Enterprise tools don't help hobbyists

Intel pulled the rug out

Don't get used to Intel's software

The Ollama problem

The tool most people actually want doesn't work

Subscribe to the newsletter for hands‑on local AI GPU insights

Worth watching, not worth buying yet

Unless you're willing to do the work