New Intel B70 GPU for local LLM: first benchmarks and RTX 3090 comparison
Intel is entering the local LLM space more seriously with the Arc B70, a 32 GB VRAM GPU aimed directly at inference workloads. The card is expected to release on April 2, with preorders already appearing on Newegg around the $949 mark.
For local LLM users, this is one of the first sub-$1000 options with enough VRAM to comfortably run 27B models beyond aggressive Q4 quantization. The question is not just capacity, but whether the performance and software stack are good enough to compete with used NVIDIA hardware.
Specs that matter for LLM workloads
The Arc B70 is clearly designed around VRAM capacity rather than raw bandwidth.
It ships with 32 GB of GDDR6 on a 256-bit bus, delivering 608 GB/s memory bandwidth. TDP is rated at 230 W, and the card uses a dual-slot blower design, which makes it easier to stack in multi-GPU builds.
From a local inference perspective, the 32 GB VRAM is the main selling point. It allows running Qwen 3.5 27B at Q6 or even Q8, something a 24 GB card like the RTX 3090 cannot do without offloading.
Level1Techs benchmarks: single request vs concurrent load
Level1Techs tested the B70 using vLLM with Qwen 3.5 27B across four GPUs. While their main focus was a 50 concurrent request scenario, they also provide insight into single-request behavior through latency and throughput numbers.
In the 50 request test with only 1024 tokens context, the system achieved about 369 tokens/sec output throughput, peaking at 550 tokens/sec. Time to first token averaged around 11.4 seconds.
For comparison, a 4x RTX 3090 setup under the same conditions delivered about 348 tokens/sec output throughput, with a significantly worse time to first token at roughly 18.7 seconds.
This suggests that in a server-style workload, the B70 setup is competitive and even slightly ahead in throughput and latency when normalized across similar configurations. However, this test uses short context and heavy batching, which does not reflect typical single-user local inference.
Single user performance: where things get cleNew Intel B70 GPU for local LLM: first benchmarks and RTX 3090 comparisonarer
For a single user or single request, memory bandwidth becomes the limiting factor. At 608 GB/s, the B70 sits far below the RTX 3090, which offers around 936 GB/s.
That gap shows up clearly in llama.cpp style workloads.
From our internal benchmarks, an RTX 3090 running Qwen 3.5 27B at 4k context delivers about 1104 tokens/sec prefill and 33 tokens/sec generation using Q4 GGUF.
The B70, in contrast, can fit much larger quantizations, including near-lossless Q8 variants around 30 GB. The tradeoff is lower token throughput. Based on bandwidth scaling and early results, it tends to land below 3090 performance for single-stream generation.
This aligns with early impressions that the B70 is slightly weaker per GPU than a 3090, but compensates with VRAM headroom.
Practical comparison: B70 vs RTX 3090
For local LLM users, the decision is straightforward and depends on your model size.
If your workload fits inside 24 GB, the RTX 3090 still offers better raw performance, higher bandwidth, and a mature CUDA stack. It remains one of the best performance-per-dollar options on the used market.
If you need more than 24 GB VRAM, the B70 becomes interesting. It allows running higher quality quantizations without splitting across GPUs, which simplifies setup and improves stability.
However, the tradeoff is software maturity. The Intel stack relies on SYCL, oneAPI, or Vulkan backends. While vLLM support is improving, real-world usability still depends heavily on drivers and tooling, which remain less polished than CUDA.
Software and ecosystem reality
The biggest concern is not hardware. It is the software stack.
There are still gaps in multi-GPU support, documentation, and ease of setup. Getting consistent performance often requires manual tuning or alternative backends like Vulkan in llama.cpp. Even then, performance can vary significantly depending on configuration.
There is progress. Upstream vLLM support exists, and Vulkan backends are improving. But this is not yet a plug-and-play experience.
Final take for local LLM builders
The Arc B70 is a VRAM-first GPU. It does not beat the RTX 3090 in raw speed, but it opens up higher quality inference for models in the 27B to 34B range without multi-GPU setups.
For single-user inference, expect lower tokens/sec than a 3090. For batched workloads, it can compete or even edge ahead in some scenarios.
The value depends entirely on your constraints. If VRAM is your bottleneck, this is one of the cheapest ways to reach 32 GB. If bandwidth and ecosystem matter more, older NVIDIA cards still hold the lead.
For now, the B70 is a practical option for experimentation and specific builds, but not yet a clear replacement for CUDA-based setups.
Read more
No comments yet.
