gemma-4-31B-it-qat-FP8
google/gemma-4-31B-it-qat-q4_0-unquantized is a 31-billion-parameter instruction-tuned multimodal model from Google DeepMind, optimized using Quantization-Aware Training (QAT) and released in an unquantized Q4_0 checkpoint format for research, custom compilation, and downstream quantization workflows. The model supports text and image inputs with text generation outputs, features a 256K-token context window, native reasoning ("thinking") capabilities, function calling, multilingual support across 140+ languages, and strong performance in coding, reasoning, document understanding, and long-context tasks. Unlike the GGUF release, this checkpoint preserves the QAT-trained weights before final deployment quantization, making it particularly suitable for experimentation with custom inference engines, FP8/NVFP4 quantization, and production optimization frameworks while maintaining quality close to the original high-precision model.
recipe.yaml
default_stage:
default_modifiers:
QuantizationModifier:
targets: [Linear]
ignore: [lm_head, 're:.*vision_tower.*', 're:.*embed_vision.*']
scheme: FP8_DYNAMIC
bypass_divisibility_checks: false
llm-compressor
An open-source library developed by the vLLM team, designed to optimize Large Language Models (LLMs) for production deployment — https://github.com/vllm-project/llm-compressor
- Downloads last month
- 161
Model tree for prithivMLmods/gemma-4-31B-it-qat-FP8
Base model
google/gemma-4-31B