Voozh

If you’ve gotten your hands on an AMD Ryzen AI Max+ 395 (Strix Halo) system, you already know the raw hardware is impressive. That massive pool of unified LPDDR5x memory is a game-changer for running large models locally. But unlocking its full potential isn’t just plug-and-play. The key to getting the best possible performance lies in the software configuration, specifically how you leverage different backends in llama.cpp.

This guide is a deep dive into those optimizations. My analysis is based on a comprehensive set of benchmarks, which I’ve distilled into actionable advice for fellow enthusiasts. Let’s get your APU running as fast as possible.

The Hardware I’m Analyzing

The data I’m working with comes from a system equipped with the Ryzen AI Max+ 395 and 128GB of LPDDR5x-8000 memory. On this setup, I’ve seen real-world memory bandwidth hit around 215 GB/s. This firehose of data is what feeds the 40 RDNA 3.5 compute units in the integrated GPU. But to really use that power, we need the right software stack. And no matter what, enabling Flash Attention is your non-negotiable first step.

The Software Battleground: Choosing Your Backend

From my experience, nearly all performance tuning on Strix Halo comes down to selecting the right backend for the job. You have two main choices: Vulkan and ROCm/HIP. Vulkan is the stable, cross-platform option, and you can run it with either the open-source Mesa RADV driver or AMD’s official AMDVLK driver. ROCm is AMD’s answer to CUDA, and while it’s been less mature historically, recent versions have become a powerful contender.

After digging through the numbers, it’s clear that no single backend wins every time. The best choice for processing your initial prompt is often different from the best choice for generating the response that follows.

Optimizing for Sheer Token Generation Speed

When your priority is pure token generation speed—essential for a snappy and responsive chatbot—my testing points to a clear winners. The Vulkan backend, specifically with the AMDVLK driver, and ROCm 6.4.4 (hipBLASLt). They consistently delivered the highest tokens per second across a wide range of models.

The reason for this comes down to the nature of the task. Token generation is primarily a memory-bandwidth-bound operation. The mature, low-overhead nature of the drivers seems to let it more effectively saturate that 215 GB/s of available memory bandwidth. If you want the fastest chat experience, configuring llama.cpp to use Vulkan with RADV or ROCm 6.4.4 (hipBLASLt). On a 70B model, I’ve seen this setup hit around 5 tokens per second, which is perfectly usable.

Optimizing for Prompt Processing (Small to Medium Context)

Ingesting the initial prompt is a different kind of workload, and here the performance race is much tighter. For typical prompt lengths, I found that both the Vulkan backend with the AMDVLK driver and the ROCm 6.4.4 backend can deliver top-tier performance.

However, getting ROCm to compete requires one critical tweak. Out of the box, its prompt processing speed can be disappointingly slow. The secret is to enable hipBLASLt by setting an environment variable before running your command. Prompt processing is a compute-bound task dominated by a single, large matrix multiplication. Specialized compute libraries like hipBLASLt are designed for exactly these kinds of parallel calculations. To enable it, simply do this:

# To enable hipBLASLt for the current terminal session
export ROCBLAS_USE_HIPBLASLT=1
# Now run your llama.cpp command as usual
./main -m <model_path> -p "My prompt..."

With this flag, ROCm’s speed jumps dramatically, putting it on par with the best Vulkan configurations.

Table: Prompt Processing Throughput (PP512) Across ROCm and Vulkan Backends

This table summarizes the PP512 token processing performance of various large language models on two backends: rocm6_4_4 and vulkan_amdvlk. Models are listed in ascending order of size (in billions of parameters), and the measured throughput is in tokens per second. For each model, the backend achieving the highest throughput is bolded. This table highlights how backend choice affects performance across model sizes and architectures. The benchmarks are done by kyuz0/amd-strix-halo-toolboxes repo.

Model	Size	rocm6_4_4 (t/s)	vulkan_amdvlk (t/s)
gemma-3-4b-it Q3_K_S	4B	2262.00	1149.64
llama-2-7b Q4_0	7B	1117.04	1380.42
gemma-3-12b-it Q8_K_XL	12B	814.18	659.67
gpt-oss-20b MXFP4	20B	1533.65	1914.72
gemma-3-27b-it BF16	27B	472.28	—
Qwen3-30B-A3B BF16	30B	489.49	140.62
Qwen3-30B-A3B-Instruct-2507 Q6_K_XL	30B	632.12	1005.86
Llama-3.3-70B-Instruct Q8_K_XL	70B	104.93	99.04
Llama-4-Scout-17B-16E-Instruct Q4_K_XL	107B	311.26	193.39
GLM-4.5-Air Q4_K_XL	110B	136.15	219.61
gpt-oss-120b MXFP4	120B	773.25	790.49
Qwen3-235B Q3_K_XL	235B	144.31	133.32

Handling Large Contexts: Where ROCm and ROCWMMA Excel

The game changes again when you’re working with massive contexts for tasks like document analysis or RAG. This is where a specialized ROCm configuration truly shines. By compiling llama.cpp with support for ROCWMMA (AMD’s library for wave matrix operations) and enabling Flash Attention, the ROCm backend becomes incredibly resilient. I found that this setup maintains its high performance with almost no drop-off as the context window grows to 8,000 tokens and beyond, all while using less memory than Vulkan. If you work with huge prompts, this HIP + WMMA + FA configuration is what you want.

Speeding Up Prompts Further with the NPU and Lemonade

For those of you on Windows, there’s another tool you should look into: the Lemonade Server SDK. It enables what’s called “hybrid execution” by using the Strix Halo’s NPU. In this mode, the initial prompt processing is offloaded to the NPU. Once the first token is generated, the system hands off token generation to the iGPU. This method gave me the absolute fastest time-to-first-token. Just be aware that this requires using models in the ONNX quantization format, not the typical GGUF.

While this hybrid approach is impressive for its time-to-first-token speed, it’s important to be aware of the current limitations. As of now, support is focused on a smaller set of models, generally capping out around the 8-billion parameter mark. Furthermore, the context lengths for these NPU-accelerated models are presently capped between 2,000 and 3,000 tokens, depending on the specific model. This effectively prevents testing with the much larger contexts that are becoming increasingly common, positioning the hybrid NPU method as a specialized tool for fast, short interactions rather than deep document analysis.

A Note on Reproducibility

Before we get to the conclusions, a quick word of caution. The world of local LLM inference moves fast. The performance numbers I’ve analyzed are specific to the drivers, kernel versions, and llama.cpp builds used during testing. Your own results might vary based on your BIOS version, Linux distribution, and software updates. I strongly encourage you to do your own benchmarking.

Strix Halo LLM Optimization Cheat Sheet

Based on my testing and analysis, here’s a breakdown of the best configurations for your Strix Halo system, tailored to your specific needs:

Best for Prompt Processing (Small to Medium Context):

It’s a tie! You can’t go wrong with either the Vulkan backend using the AMDVLK driver or the ROCm backend, provided you enable ROCBLAS_USE_HIPBLASLT=1. The choice comes down to your preference and what’s easiest to get running smoothly on your setup.

Best for Prompt Processing with Smaller Model:

If you’re on Windows and you are using models up to 8B, Lemonade Server’s hybrid execution mode is the winner. It uses the NPU for prompt processing, but be aware that at the moment it only supports ONNX models and context length up to 3K.

Best for Token Generation:

If you’re prioritizing smooth, responsive token generation, especially in a conversational setting, the clear winner is the Vulkan backend with the Mesa RADV driver.

Best for Prompt Processing (Large Contexts):

To handle those monster prompts, your go-to configuration should be the ROCm backend with ROCWMMA enabled and Flash Attention turned on. This combination delivers the best performance and memory efficiency when you’re pushing the limits of your context window.

URL: https://www.hardware-corner.net/strix-halo-llm-optimization/

⇱ I optimized my Strix Halo for local LLMs: Here are the benchmarks and findings.

I optimized my Strix Halo for local LLMs: Here are the benchmarks and findings.