Mistral-Small-3.2-24B-Instruct-2506-NVFP4

Model Overview

Model Architecture: unsloth/Mistral-Small-3.2-24B-Instruct-2506
- Input: Text
- Output: Text
Model Optimizations:
- Weight quantization: FP4
- Activation quantization: FP4
Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
Release Date: 10/29/2025
Version: 1.0
Model Developers: RedHatAI

This model is a quantized version of unsloth/Mistral-Small-3.2-24B-Instruct-2506. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.

Model Optimizations

This model was obtained by quantizing the weights and activations of unsloth/Mistral-Small-3.2-24B-Instruct-2506 to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor.

Deployment

Use with vLLM

Initialize vLLM server:

vllm serve RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 --tensor_parallel_size 1 --tokenizer_mode mistral

Send requests to the server:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
 api_key=openai_api_key,
 base_url=openai_api_base,
)

model = "RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4"


messages = [
 {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = client.chat.completions.create(
 model=model,
 messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below.

Evaluation

This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval_64 benchmarks using lm-evaluation-harness.

Accuracy

Category	Metric	unsloth/Mistral-Small-3.2-24B-Instruct-2506	RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4	Recovery
OpenLLM V1	arc_challenge	68.52	66.98	97.75
	gsm8k	89.61	87.11	97.21
	hellaswag	85.70	85.11	99.31
	mmlu	81.06	79.43	97.99
	truthfulqa_mc2	61.35	60.34	98.35
	winogrande	83.27	81.61	98.01
	Average	78.25	76.76	98.10
OpenLLM V2	BBH (3-shot)	65.86	64.05	97.25
	MMLU-Pro (5-shot)	50.84	48.45	95.30
	MuSR (0-shot)	39.15	40.21	102.71
	IFEval (0-shot)	84.05	84.41	100.43
	GPQA (0-shot)	33.14	32.55	98.22
	Math-\|v\|-5 (4-shot)	41.69	37.76	90.57
	Average	52.46	51.24	97.68
Coding	HumanEval_64 pass@2	88.88	88.84	99.95

Reproduction

The results were obtained using the following commands:

Downloads last month: 4,056

Safetensors

Model size

14B params

Tensor type

BF16

F32

F8_E4M3

Model tree for RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4

Base model

mistralai/Mistral-Small-3.1-24B-Base-2503

Finetuned

mistralai/Mistral-Small-3.2-24B-Instruct-2506

Finetuned

unsloth/Mistral-Small-3.2-24B-Instruct-2506

Quantized

(3)

this model

Collection including RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4

17 items • Updated Apr 30 • 22

URL: https://huggingface.co/RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4

⇱ RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 · Hugging Face