Phi-4-reasoning-FP8-dynamic 👁 Model Icon

Model Overview

Model Architecture: Phi3ForCausalLM
- Input: Text
- Output: Text
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Intended Use Cases: This model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:

Memory/compute constrained environments.
Latency bound scenarios.
Math reasoning and logic.

Release Date: 01/26/2026
Version: 1.0
Model Developers: Red Hat
ModelCar Storage URI: oci://registry.redhat.io/rhai/modelcar-phi-4-reasoning-fp8-dynamic:3.0
Validated on vLLM: 0.13.0
Validated on RHAIIS: 3.3
Validated on RHOAI: 3.3

Model Optimizations

This model was obtained by quantizing activation and weights of Phi-4-reasoning to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.

Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization.

Deployment

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1

from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
 api_key=openai_api_key,
 base_url=openai_api_base,
)

generated_text = client.chat.completions.create(
 model="RedHatAI/Phi-4-reasoning-FP8-dynamic",
 messages=[
 {"role": "user", "content": "Give me a short introduction to large language model."},
 ],
)
print(generated_text.choices[0].message.content)

Creation

Evaluation

The model was evaluated on the AIME25, GPQA Diamond and Mathh 500 benchmarks using lighteval, and on MMLU-Pro using lm-evaluation-harness. In both cases vLLM is used as the backend

Accuracy

Benchmark	Phi-4-reasoning	Phi-4-reasoning-FP8-dynamic (this model)	Recovery
AIME25	61.25	64.58	105.4%
GPQA Diamond	64.65	66.50	102.9%
Math 500	90.01	88.60	98.4%
MMLU-Pro	76.49	76.85	100.5%

Downloads last month: 244

Safetensors

Model size

15B params

Tensor type

BF16

F8_E4M3

Model tree for RedHatAI/Phi-4-reasoning-FP8-dynamic

Base model

microsoft/phi-4

Finetuned

microsoft/Phi-4-reasoning

Quantized

(32)

this model

Collection including RedHatAI/Phi-4-reasoning-FP8-dynamic

February 2026 Collection of third-party generative AI models validated by Red Hat AI for use across the Red Hat AI Product Portfolio. • 6 items • Updated Apr 30 • 2

URL: https://huggingface.co/RedHatAI/Phi-4-reasoning-FP8-dynamic

⇱ RedHatAI/Phi-4-reasoning-FP8-dynamic · Hugging Face