VOOZH about

URL: https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block

⇱ RedHatAI/gemma-4-31B-it-FP8-block · Hugging Face


gemma-4-31B-it-FP8-block

Model Overview

  • Model Architecture: google/gemma-4-31B-it
    • Input: Text / Image
    • Output: Text
  • Model Optimizations:
    • Weight quantization: FP8
    • Activation quantization: FP8
  • Release Date: 2026-04-04
  • Version: 1.0
  • Model Developers: RedHatAI

This model is a quantized version of google/gemma-4-31B-it. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Model Optimizations

This model was obtained by quantizing the weights and activations of google/gemma-4-31B-it to FP8 data type, ready for inference with vLLM. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor. Vision tower, embedding, and output head layers are kept in their original precision.

Deployment

Use with vLLM

This model can be deployed using vLLM. For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the Gemma 4 vLLM usage guide.

  1. Start the vLLM server:
vllm serve RedHatAI/gemma-4-31B-it-FP8-block --max-model-len 32768

To enable thinking/reasoning and tool calling:

vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
 --max-model-len 32768 \
 --reasoning-parser gemma4 \
 --tool-call-parser gemma4 \
 --enable-auto-tool-choice

Tip: For text-only workloads, pass --limit-mm-per-prompt image=0 to skip vision encoder memory allocation. Set --gpu-memory-utilization 0.90 to maximize KV cache capacity.

  1. Send requests to the server:
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
 api_key=openai_api_key,
 base_url=openai_api_base,
)

model = "RedHatAI/gemma-4-31B-it-FP8-block"

messages = [
 {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = client.chat.completions.create(
 model=model,
 messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was created by applying data-free FP8 block quantization with LLM Compressor, as presented in the code snippet below.

Evaluation

This model was evaluated on GSM8k-Platinum, MMLU-CoT, MMLU-Pro, and IFEval using lm-evaluation-harness, served with vLLM (OpenAI-compatible API). All evaluations were performed with thinking turned off.

Accuracy

Category Benchmark google/gemma-4-31B-it RedHatAI/gemma-4-31B-it-FP8-block Recovery
Instruction Following GSM8k-Platinum (5-shot, strict-match) 97.60 97.82 100.2%
MMLU-CoT (5-shot, strict_match) 90.53 90.70 100.2%
MMLU-Pro (5-shot, custom-extract) 85.03 84.92 99.9%
IFEval (0-shot, prompt-level strict) 91.07 91.31 100.3%
IFEval (0-shot, inst-level strict) 93.76 93.84 100.1%

Reproduction

The results were obtained using the following commands:

Downloads last month
778,328
Safetensors
Model size
31B params
Tensor type
BF16
·
F8_E4M3
·

Model tree for RedHatAI/gemma-4-31B-it-FP8-block

Quantized
(241)
this model
Finetunes
1 model

Space using RedHatAI/gemma-4-31B-it-FP8-block 1