VOOZH about

URL: https://huggingface.co/corsairnui/diffusiongemma-26b-a4b-it-strix-halo-fp16

⇱ corsairnui/diffusiongemma-26b-a4b-it-strix-halo-fp16 · Hugging Face


DiffusionGemma 26B A4B IT for AMD Strix Halo

This is an AMD Strix Halo focused build of google/diffusiongemma-26B-A4B-it. It includes locally converted GGUF files plus the benchmark settings that worked best on the tested Strix Halo machine.

The goal is simple: make DiffusionGemma easier to run and measure on Ryzen AI Max hardware, with clear numbers for the baseline path and the faster local settings.

This is an independent personal release. It is not an official Corsair, AMD, Google DeepMind, Hugging Face, vLLM, or llama.cpp build.

tested system

Component Value
APU AMD Ryzen AI Max+ 395
GPU Radeon 8060S Graphics
ROCm target gfx1151
Memory 124 GiB system memory
OS Fedora 43

model files

File Size Use
weights/diffusiongemma-26B-A4B-it-Q5_K_M-self.gguf 19.15 GB Primary GGUF file for local Strix Halo testing.
weights/diffusiongemma-26B-A4B-it-Q4_K_M-self.gguf 16.81 GB Smaller GGUF file. Faster to move around, but weaker in the current quality gate.
weights/diffusiongemma-26B-A4B-it-BF16.gguf 50.54 GB Full precision GGUF conversion artifact.

The GGUF files were converted from the upstream Google checkpoint and quantized locally with llama.cpp tooling. During quantization, llama-quantize reported fallback quantization for 61 of 692 tensors.

After download, verify the files:

sha256sum -c manifests/WEIGHTS_SHA256SUMS.txt

Checksum and provenance files live in manifests/.

performance

The clean matched result is the FP16 Transformers path. Same prompt, same 1536-token target, direct generation timing only, five measured runs.

Path Basic local result Faster local result Speedup
FP16 Transformers, mean throughput 115.42 tok/s 134.65 tok/s 16.66% faster
FP16 Transformers, median throughput 115.54 tok/s 135.85 tok/s 17.58% faster

The faster FP16 run used:

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

The GGUF path saw a larger jump in the interactive llama.cpp setup. This is useful for local use, but it is not the same benchmark shape as the FP16 table above.

GGUF path Basic matched CLI row Fast conversation-mode row Speedup
Q5_K_M 78.36 estimated tok/s 124.49 estimated tok/s 58.86% faster
Q4_K_M 77.37 estimated tok/s 122.40 estimated tok/s 58.21% faster

The fast GGUF command shape was:

RUNNER=/path/to/llama-diffusion-cli
MODEL=weights/diffusiongemma-26B-A4B-it-Q5_K_M-self.gguf

"$RUNNER" \
 -m "$MODEL" \
 -p "Explain text diffusion in three concise bullets." \
 -n 2048 \
 --perf \
 -ngl 99 \
 -cnv \
 --diffusion-eb auto \
 --diffusion-kv-cache auto

For Q4_K_M, use the Q4 file in MODEL with the same flags.

what changed

This release does not change the original Google model architecture. The work here is packaging and runtime tuning for AMD Strix Halo:

  • Converted the upstream checkpoint into BF16 GGUF.
  • Produced local Q5_K_M and Q4_K_M GGUF files.
  • Validated DiffusionGemma generation on a Strix Halo ROCm stack.
  • Measured a faster FP16 path with ROCm AOTriton enabled.
  • Measured faster interactive GGUF settings through llama-diffusion-cli.

recommended use

Use Q5_K_M first if you want the best local GGUF candidate from this release. Use Q4_K_M when the smaller file matters more than the current quality gate. Use the upstream HF/safetensors model with Transformers when you want the strict matched FP16 result.

For the fastest measured FP16 path on the tested machine, enable ROCm AOTriton before generation:

export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

For GGUF, use a DiffusionGemma-capable llama-diffusion-cli HIP build for gfx1151 and the conversation-mode command above.

quality notes

The Q5_K_M file passed 5 of 5 deterministic FP16-comparison prompt gates in the limited local eval. Q4_K_M passed 3 of 5. Treat Q5_K_M as the primary GGUF candidate from this release. Treat Q4_K_M as a smaller experimental option.

These checks are not a substitute for task-specific evaluation. DiffusionGemma can still produce wrong, unsafe, biased, or unusable output.

limitations

  • The benchmark numbers are local to the tested Strix Halo host, ROCm stack, prompts, and runner settings.
  • The validation is text-only. Image and video inputs were not validated in this release.
  • Quality review so far is limited to a small local prompt gate. Run your own evaluation before relying on the quantized files for a real workflow.
  • Q4_K_M is not quality parity with the FP16 baseline in the current prompt gate.
  • The standard ROCm vLLM image did not serve DiffusionGemma on this host without a branch overlay.
  • The standard llama-server path loaded the GGUF, but generation endpoints returned a logits-context error.
  • The local resident HTTP wrapper was used for demos and smoke tests only. It is not a production endpoint.
  • This repo does not claim official Corsair, AMD, Google DeepMind, Hugging Face, vLLM, or llama.cpp support, endorsement, or production readiness.

repository contents

The public repo is intentionally small:

Path Contents
weights/ BF16, Q5_K_M, and Q4_K_M GGUF files
manifests/ Checksums, file metadata, and checkpoint selection notes
LICENSE, NOTICE License and attribution files
README.md Model card, setup notes, measured performance, and limitations

Detailed benchmark logs and internal review notes are not required to run the model, so they are not part of the public release package.

license and attribution

Documentation and helper metadata in this repository are Apache-2.0. See LICENSE, apache_2.0_license.md, and NOTICE.

The model weights are derived from google/diffusiongemma-26B-A4B-it. Use and redistribution must follow the upstream model card, Gemma terms, and applicable law.

responsible use

Evaluate the model before using it in any product or workflow where mistakes matter, especially code, security, medical, legal, financial, or other high-impact tasks.

Downloads last month
1,118
GGUF
Model size
25B params
Architecture
diffusion-gemma
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

16-bit

Model tree for corsairnui/diffusiongemma-26b-a4b-it-strix-halo-fp16

Quantized
(26)
this model

Evaluation results

  • Matched FP16 baseline mean tok/s on Local Strix Halo benchmark prompts
    self-reported
    115.421
  • Matched FP16 AOTriton mean tok/s on Local Strix Halo benchmark prompts
    self-reported
    134.646
  • Matched FP16 AOTriton mean gain percent on Local Strix Halo benchmark prompts
    self-reported
    16.656
  • Q5_K_M conversation 2048-token mean tok/s estimate on Local Strix Halo benchmark prompts
    self-reported
    124.491
  • Q4_K_M conversation 2048-token mean tok/s estimate on Local Strix Halo benchmark prompts
    self-reported
    122.401
  • Q5_K_M deterministic FP16-comparison gates passed on Local Strix Halo benchmark prompts
    self-reported
    5.000
  • Q5_K_M deterministic FP16-comparison gates total on Local Strix Halo benchmark prompts
    self-reported
    5.000
  • Q4_K_M deterministic FP16-comparison gates passed on Local Strix Halo benchmark prompts
    self-reported
    3.000