gemma-4-31B-it-NVFP4

Model Overview

Model Architecture: google/gemma-4-31B-it
- Input: Text / Image
- Output: Text
Model Optimizations:
- Weight quantization: FP4
- Activation quantization: FP4
Release Date: 2026-04-04
Version: 1.0
Model Developers: RedHatAI

This model is a quantized version of google/gemma-4-31B-it. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Model Optimizations

This model was obtained by quantizing the weights and activations of google/gemma-4-31B-it to FP4 data type using the NVFP4 format, ready for inference with vLLM. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Weights are quantized with FP4 (group_size=16), and activations are quantized with FP4 using local per-group scaling. Only the weights and activations of the linear operators within transformer blocks are quantized using LLM Compressor. Vision tower, embedding, and output head layers are kept in their original precision.

Deployment

Use with vLLM

This model can be deployed using vLLM. For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the Gemma 4 vLLM usage guide.

Start the vLLM server:

vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
 --max-model-len 32768 \
 --gpu-memory-utilization 0.90

To enable thinking/reasoning and tool calling:

vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \
 --max-model-len 32768 \
 --gpu-memory-utilization 0.90 \
 --enable-auto-tool-choice \
 --reasoning-parser gemma4 \
 --tool-call-parser gemma4 \
 --chat-template examples/tool_chat_template_gemma4.jinja \
 --limit-mm-per-prompt '{"image": 4, "audio": 1}' \
 --async-scheduling

Tip: For text-only workloads, pass --limit-mm-per-prompt '{"image": 0, "audio": 0}' to skip vision encoder memory allocation and free up GPU memory for a longer context window.

Send requests to the server:

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
 api_key=openai_api_key,
 base_url=openai_api_base,
)

model = "RedHatAI/gemma-4-31B-it-NVFP4"

messages = [
 {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = client.chat.completions.create(
 model=model,
 messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was created by applying NVFP4 quantization with LLM Compressor, as presented in the code snippet below.

Evaluation

This model was evaluated on GSM8K Platinum, MMLU-Pro, IFEval, MATH-500, AIME 2025, GPQA Diamond, and LiveCodeBench v6 using lm-evaluation-harness and lighteval, served with vLLM (OpenAI-compatible API). All evaluations were performed with thinking enabled.

Accuracy

Category	Benchmark	google/gemma-4-31B-it	RedHatAI/gemma-4-31B-it-NVFP4	Recovery
Instruction Following	IFEval (0-shot, prompt-level strict)	90.70	89.77	99.0%
Instruction Following	IFEval (0-shot, inst-level strict)	93.45	93.05	99.6%
Reasoning	GSM8K Platinum (0-shot, strict-match)	95.78	95.70	99.9%
	MMLU-Pro (0-shot, custom-extract)	85.41	84.50	98.9%
	MATH-500 (0-shot, pass@1)	89.40	85.07	95.2%
	AIME 2025 (0-shot, pass@1)	65.83	65.00	98.7%
	GPQA Diamond (0-shot, pass@1)	77.44	76.60	98.9%
Coding	LiveCodeBench v6 (0-shot, pass@1)	71.43	70.67	98.9%

Reproduction

The results were obtained using the following commands:

Downloads last month: 226,455

Safetensors

Model size

20B params

Tensor type

F32

BF16

F8_E4M3

Model tree for RedHatAI/gemma-4-31B-it-NVFP4

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Quantized

(243)

this model

Quantizations

1 model

URL: https://huggingface.co/RedHatAI/gemma-4-31B-it-NVFP4

⇱ RedHatAI/gemma-4-31B-it-NVFP4 · Hugging Face