gemma-4-31B-it-qat-FP8

google/gemma-4-31B-it-qat-q4_0-unquantized is a 31-billion-parameter instruction-tuned multimodal model from Google DeepMind, optimized using Quantization-Aware Training (QAT) and released in an unquantized Q4_0 checkpoint format for research, custom compilation, and downstream quantization workflows. The model supports text and image inputs with text generation outputs, features a 256K-token context window, native reasoning ("thinking") capabilities, function calling, multilingual support across 140+ languages, and strong performance in coding, reasoning, document understanding, and long-context tasks. Unlike the GGUF release, this checkpoint preserves the QAT-trained weights before final deployment quantization, making it particularly suitable for experimentation with custom inference engines, FP8/NVFP4 quantization, and production optimization frameworks while maintaining quality close to the original high-precision model.

recipe.yaml

default_stage:
 default_modifiers:
 QuantizationModifier:
 targets: [Linear]
 ignore: [lm_head, 're:.*vision_tower.*', 're:.*embed_vision.*']
 scheme: FP8_DYNAMIC
 bypass_divisibility_checks: false

llm-compressor

An open-source library developed by the vLLM team, designed to optimize Large Language Models (LLMs) for production deployment — https://github.com/vllm-project/llm-compressor