VOOZH about

URL: https://huggingface.co/RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4

⇱ RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 · Hugging Face


Mistral-Small-3.2-24B-Instruct-2506-NVFP4

Model Overview

  • Model Architecture: unsloth/Mistral-Small-3.2-24B-Instruct-2506
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: FP4
    • Activation quantization: FP4
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
  • Release Date: 10/29/2025
  • Version: 1.0
  • Model Developers: RedHatAI

This model is a quantized version of unsloth/Mistral-Small-3.2-24B-Instruct-2506. It was evaluated on a several tasks to assess the its quality in comparison to the unquatized model.

Model Optimizations

This model was obtained by quantizing the weights and activations of unsloth/Mistral-Small-3.2-24B-Instruct-2506 to FP4 data type, ready for inference with vLLM>=0.9.1 This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.

Only the weights and activations of the linear operators within transformers blocks are quantized using LLM Compressor.

Deployment

Use with vLLM

  1. Initialize vLLM server:
vllm serve RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 --tensor_parallel_size 1 --tokenizer_mode mistral
  1. Send requests to the server:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
 api_key=openai_api_key,
 base_url=openai_api_base,
)

model = "RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4"


messages = [
 {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = client.chat.completions.create(
 model=model,
 messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was created by applying LLM Compressor with calibration samples from UltraChat, as presented in the code snipet below.

Evaluation

This model was evaluated on the well-known OpenLLM v1, OpenLLM v2 and HumanEval_64 benchmarks using lm-evaluation-harness.

Accuracy

Category Metric unsloth/Mistral-Small-3.2-24B-Instruct-2506 RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4 Recovery
OpenLLM V1 arc_challenge 68.52 66.98 97.75
gsm8k 89.61 87.11 97.21
hellaswag 85.70 85.11 99.31
mmlu 81.06 79.43 97.99
truthfulqa_mc2 61.35 60.34 98.35
winogrande 83.27 81.61 98.01
Average 78.25 76.76 98.10
OpenLLM V2 BBH (3-shot) 65.86 64.05 97.25
MMLU-Pro (5-shot) 50.84 48.45 95.30
MuSR (0-shot) 39.15 40.21 102.71
IFEval (0-shot) 84.05 84.41 100.43
GPQA (0-shot) 33.14 32.55 98.22
Math-|v|-5 (4-shot) 41.69 37.76 90.57
Average 52.46 51.24 97.68
Coding HumanEval_64 pass@2 88.88 88.84 99.95

Reproduction

The results were obtained using the following commands:

Downloads last month
4,056
Safetensors
Model size
14B params
Tensor type
BF16
·
F32
·
F8_E4M3
·
U8
·

Model tree for RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4

Collection including RedHatAI/Mistral-Small-3.2-24B-Instruct-2506-NVFP4