DiffusionGemma 26B A4B IT for AMD Strix Halo
This is an AMD Strix Halo focused build of google/diffusiongemma-26B-A4B-it. It includes locally converted GGUF files plus the benchmark settings that worked best on the tested Strix Halo machine.
The goal is simple: make DiffusionGemma easier to run and measure on Ryzen AI Max hardware, with clear numbers for the baseline path and the faster local settings.
This is an independent personal release. It is not an official Corsair, AMD, Google DeepMind, Hugging Face, vLLM, or llama.cpp build.
tested system
| Component | Value |
|---|---|
| APU | AMD Ryzen AI Max+ 395 |
| GPU | Radeon 8060S Graphics |
| ROCm target | gfx1151 |
| Memory | 124 GiB system memory |
| OS | Fedora 43 |
model files
| File | Size | Use |
|---|---|---|
weights/diffusiongemma-26B-A4B-it-Q5_K_M-self.gguf |
19.15 GB | Primary GGUF file for local Strix Halo testing. |
weights/diffusiongemma-26B-A4B-it-Q4_K_M-self.gguf |
16.81 GB | Smaller GGUF file. Faster to move around, but weaker in the current quality gate. |
weights/diffusiongemma-26B-A4B-it-BF16.gguf |
50.54 GB | Full precision GGUF conversion artifact. |
The GGUF files were converted from the upstream Google checkpoint and quantized locally with llama.cpp tooling. During quantization, llama-quantize reported fallback quantization for 61 of 692 tensors.
After download, verify the files:
sha256sum -c manifests/WEIGHTS_SHA256SUMS.txt
Checksum and provenance files live in manifests/.
performance
The clean matched result is the FP16 Transformers path. Same prompt, same 1536-token target, direct generation timing only, five measured runs.
| Path | Basic local result | Faster local result | Speedup |
|---|---|---|---|
| FP16 Transformers, mean throughput | 115.42 tok/s | 134.65 tok/s | 16.66% faster |
| FP16 Transformers, median throughput | 115.54 tok/s | 135.85 tok/s | 17.58% faster |
The faster FP16 run used:
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
The GGUF path saw a larger jump in the interactive llama.cpp setup. This is useful for local use, but it is not the same benchmark shape as the FP16 table above.
| GGUF path | Basic matched CLI row | Fast conversation-mode row | Speedup |
|---|---|---|---|
| Q5_K_M | 78.36 estimated tok/s | 124.49 estimated tok/s | 58.86% faster |
| Q4_K_M | 77.37 estimated tok/s | 122.40 estimated tok/s | 58.21% faster |
The fast GGUF command shape was:
RUNNER=/path/to/llama-diffusion-cli
MODEL=weights/diffusiongemma-26B-A4B-it-Q5_K_M-self.gguf
"$RUNNER" \
-m "$MODEL" \
-p "Explain text diffusion in three concise bullets." \
-n 2048 \
--perf \
-ngl 99 \
-cnv \
--diffusion-eb auto \
--diffusion-kv-cache auto
For Q4_K_M, use the Q4 file in MODEL with the same flags.
what changed
This release does not change the original Google model architecture. The work here is packaging and runtime tuning for AMD Strix Halo:
- Converted the upstream checkpoint into BF16 GGUF.
- Produced local Q5_K_M and Q4_K_M GGUF files.
- Validated DiffusionGemma generation on a Strix Halo ROCm stack.
- Measured a faster FP16 path with ROCm AOTriton enabled.
- Measured faster interactive GGUF settings through
llama-diffusion-cli.
recommended use
Use Q5_K_M first if you want the best local GGUF candidate from this release. Use Q4_K_M when the smaller file matters more than the current quality gate. Use the upstream HF/safetensors model with Transformers when you want the strict matched FP16 result.
For the fastest measured FP16 path on the tested machine, enable ROCm AOTriton before generation:
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
For GGUF, use a DiffusionGemma-capable llama-diffusion-cli HIP build for gfx1151 and the conversation-mode command above.
quality notes
The Q5_K_M file passed 5 of 5 deterministic FP16-comparison prompt gates in the limited local eval. Q4_K_M passed 3 of 5. Treat Q5_K_M as the primary GGUF candidate from this release. Treat Q4_K_M as a smaller experimental option.
These checks are not a substitute for task-specific evaluation. DiffusionGemma can still produce wrong, unsafe, biased, or unusable output.
limitations
- The benchmark numbers are local to the tested Strix Halo host, ROCm stack, prompts, and runner settings.
- The validation is text-only. Image and video inputs were not validated in this release.
- Quality review so far is limited to a small local prompt gate. Run your own evaluation before relying on the quantized files for a real workflow.
- Q4_K_M is not quality parity with the FP16 baseline in the current prompt gate.
- The standard ROCm vLLM image did not serve DiffusionGemma on this host without a branch overlay.
- The standard
llama-serverpath loaded the GGUF, but generation endpoints returned a logits-context error. - The local resident HTTP wrapper was used for demos and smoke tests only. It is not a production endpoint.
- This repo does not claim official Corsair, AMD, Google DeepMind, Hugging Face, vLLM, or llama.cpp support, endorsement, or production readiness.
repository contents
The public repo is intentionally small:
| Path | Contents |
|---|---|
weights/ |
BF16, Q5_K_M, and Q4_K_M GGUF files |
manifests/ |
Checksums, file metadata, and checkpoint selection notes |
LICENSE, NOTICE |
License and attribution files |
README.md |
Model card, setup notes, measured performance, and limitations |
Detailed benchmark logs and internal review notes are not required to run the model, so they are not part of the public release package.
license and attribution
Documentation and helper metadata in this repository are Apache-2.0. See LICENSE, apache_2.0_license.md, and NOTICE.
The model weights are derived from google/diffusiongemma-26B-A4B-it. Use and redistribution must follow the upstream model card, Gemma terms, and applicable law.
responsible use
Evaluate the model before using it in any product or workflow where mistakes matter, especially code, security, medical, legal, financial, or other high-impact tasks.
- Downloads last month
- 1,118
4-bit
5-bit
16-bit
Model tree for corsairnui/diffusiongemma-26b-a4b-it-strix-halo-fp16
Base model
google/diffusiongemma-26B-A4B-itEvaluation results
- Matched FP16 baseline mean tok/s on Local Strix Halo benchmark promptsself-reported115.421
- Matched FP16 AOTriton mean tok/s on Local Strix Halo benchmark promptsself-reported134.646
- Matched FP16 AOTriton mean gain percent on Local Strix Halo benchmark promptsself-reported16.656
- Q5_K_M conversation 2048-token mean tok/s estimate on Local Strix Halo benchmark promptsself-reported124.491
- Q4_K_M conversation 2048-token mean tok/s estimate on Local Strix Halo benchmark promptsself-reported122.401
- Q5_K_M deterministic FP16-comparison gates passed on Local Strix Halo benchmark promptsself-reported5.000
- Q5_K_M deterministic FP16-comparison gates total on Local Strix Halo benchmark promptsself-reported5.000
- Q4_K_M deterministic FP16-comparison gates passed on Local Strix Halo benchmark promptsself-reported3.000
