DiffusionGemma 26B A4B IT for AMD Strix Halo

This is an AMD Strix Halo focused build of google/diffusiongemma-26B-A4B-it. It includes locally converted GGUF files plus the benchmark settings that worked best on the tested Strix Halo machine.

The goal is simple: make DiffusionGemma easier to run and measure on Ryzen AI Max hardware, with clear numbers for the baseline path and the faster local settings.

This is an independent personal release. It is not an official Corsair, AMD, Google DeepMind, Hugging Face, vLLM, or llama.cpp build.

tested system

Component	Value
APU	AMD Ryzen AI Max+ 395
GPU	Radeon 8060S Graphics
ROCm target	`gfx1151`
Memory	124 GiB system memory
OS	Fedora 43

model files

File	Size	Use
`weights/diffusiongemma-26B-A4B-it-Q5_K_M-self.gguf`	19.15 GB	Primary GGUF file for local Strix Halo testing.
`weights/diffusiongemma-26B-A4B-it-Q4_K_M-self.gguf`	16.81 GB	Smaller GGUF file. Faster to move around, but weaker in the current quality gate.
`weights/diffusiongemma-26B-A4B-it-BF16.gguf`	50.54 GB	Full precision GGUF conversion artifact.

The GGUF files were converted from the upstream Google checkpoint and quantized locally with llama.cpp tooling. During quantization, llama-quantize reported fallback quantization for 61 of 692 tensors.

After download, verify the files:

sha256sum -c manifests/WEIGHTS_SHA256SUMS.txt

Checksum and provenance files live in manifests/.

performance

The clean matched result is the FP16 Transformers path. Same prompt, same 1536-token target, direct generation timing only, five measured runs.

Path	Basic local result	Faster local result	Speedup
FP16 Transformers, mean throughput	115.42 tok/s	134.65 tok/s	16.66% faster
FP16 Transformers, median throughput	115.54 tok/s	135.85 tok/s	17.58% faster

The faster FP16 run used:

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

The GGUF path saw a larger jump in the interactive llama.cpp setup. This is useful for local use, but it is not the same benchmark shape as the FP16 table above.

GGUF path	Basic matched CLI row	Fast conversation-mode row	Speedup
Q5_K_M	78.36 estimated tok/s	124.49 estimated tok/s	58.86% faster
Q4_K_M	77.37 estimated tok/s	122.40 estimated tok/s	58.21% faster

The fast GGUF command shape was:

RUNNER=/path/to/llama-diffusion-cli
MODEL=weights/diffusiongemma-26B-A4B-it-Q5_K_M-self.gguf

"$RUNNER" \
 -m "$MODEL" \
 -p "Explain text diffusion in three concise bullets." \
 -n 2048 \
 --perf \
 -ngl 99 \
 -cnv \
 --diffusion-eb auto \
 --diffusion-kv-cache auto

For Q4_K_M, use the Q4 file in MODEL with the same flags.

what changed

This release does not change the original Google model architecture. The work here is packaging and runtime tuning for AMD Strix Halo:

Converted the upstream checkpoint into BF16 GGUF.
Produced local Q5_K_M and Q4_K_M GGUF files.
Validated DiffusionGemma generation on a Strix Halo ROCm stack.
Measured a faster FP16 path with ROCm AOTriton enabled.
Measured faster interactive GGUF settings through llama-diffusion-cli.

recommended use

Use Q5_K_M first if you want the best local GGUF candidate from this release. Use Q4_K_M when the smaller file matters more than the current quality gate. Use the upstream HF/safetensors model with Transformers when you want the strict matched FP16 result.

For the fastest measured FP16 path on the tested machine, enable ROCm AOTriton before generation:

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

For GGUF, use a DiffusionGemma-capable llama-diffusion-cli HIP build for gfx1151 and the conversation-mode command above.

quality notes

The Q5_K_M file passed 5 of 5 deterministic FP16-comparison prompt gates in the limited local eval. Q4_K_M passed 3 of 5. Treat Q5_K_M as the primary GGUF candidate from this release. Treat Q4_K_M as a smaller experimental option.

These checks are not a substitute for task-specific evaluation. DiffusionGemma can still produce wrong, unsafe, biased, or unusable output.

limitations

The benchmark numbers are local to the tested Strix Halo host, ROCm stack, prompts, and runner settings.
The validation is text-only. Image and video inputs were not validated in this release.
Quality review so far is limited to a small local prompt gate. Run your own evaluation before relying on the quantized files for a real workflow.
Q4_K_M is not quality parity with the FP16 baseline in the current prompt gate.
The standard ROCm vLLM image did not serve DiffusionGemma on this host without a branch overlay.
The standard llama-server path loaded the GGUF, but generation endpoints returned a logits-context error.
The local resident HTTP wrapper was used for demos and smoke tests only. It is not a production endpoint.
This repo does not claim official Corsair, AMD, Google DeepMind, Hugging Face, vLLM, or llama.cpp support, endorsement, or production readiness.

repository contents

The public repo is intentionally small:

Path	Contents
`weights/`	BF16, Q5_K_M, and Q4_K_M GGUF files
`manifests/`	Checksums, file metadata, and checkpoint selection notes
`LICENSE`, `NOTICE`	License and attribution files
`README.md`	Model card, setup notes, measured performance, and limitations

Detailed benchmark logs and internal review notes are not required to run the model, so they are not part of the public release package.

license and attribution

Documentation and helper metadata in this repository are Apache-2.0. See LICENSE, apache_2.0_license.md, and NOTICE.

The model weights are derived from google/diffusiongemma-26B-A4B-it. Use and redistribution must follow the upstream model card, Gemma terms, and applicable law.

responsible use

Evaluate the model before using it in any product or workflow where mistakes matter, especially code, security, medical, legal, financial, or other high-impact tasks.

Downloads last month: 1,118

GGUF

Model size

25B params

Architecture

diffusion-gemma

Hardware compatibility

4-bit

5-bit

16-bit

Model tree for corsairnui/diffusiongemma-26b-a4b-it-strix-halo-fp16

Base model

google/diffusiongemma-26B-A4B-it

Quantized

(26)

this model

Evaluation results

Matched FP16 baseline mean tok/s on Local Strix Halo benchmark prompts
self-reported
115.421
Matched FP16 AOTriton mean tok/s on Local Strix Halo benchmark prompts
self-reported
134.646
Matched FP16 AOTriton mean gain percent on Local Strix Halo benchmark prompts
self-reported
16.656
Q5_K_M conversation 2048-token mean tok/s estimate on Local Strix Halo benchmark prompts
self-reported
124.491
Q4_K_M conversation 2048-token mean tok/s estimate on Local Strix Halo benchmark prompts
self-reported
122.401
Q5_K_M deterministic FP16-comparison gates passed on Local Strix Halo benchmark prompts
self-reported
5.000
Q5_K_M deterministic FP16-comparison gates total on Local Strix Halo benchmark prompts
self-reported
5.000
Q4_K_M deterministic FP16-comparison gates passed on Local Strix Halo benchmark prompts
self-reported
3.000

URL: https://huggingface.co/corsairnui/diffusiongemma-26b-a4b-it-strix-halo-fp16

⇱ corsairnui/diffusiongemma-26b-a4b-it-strix-halo-fp16 · Hugging Face