Phi-4-reasoning-FP8-dynamic
👁 Model Icon
👁 Validated BadgeModel Overview
- Model Architecture: Phi3ForCausalLM
- Input: Text
- Output: Text
- Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
- Intended Use Cases: This model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:
- Memory/compute constrained environments.
- Latency bound scenarios.
- Math reasoning and logic.
- Release Date: 01/26/2026
- Version: 1.0
- Model Developers: Red Hat
- ModelCar Storage URI: oci://registry.redhat.io/rhai/modelcar-phi-4-reasoning-fp8-dynamic:3.0
- Validated on vLLM: 0.13.0
- Validated on RHAIIS: 3.3
- Validated on RHOAI: 3.3
Model Optimizations
This model was obtained by quantizing activation and weights of Phi-4-reasoning to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%.
Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. The llm-compressor library is used for quantization.
Deployment
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
vllm serve RedHatAI/Phi-4-reasoning-FP8-dynamic --reasoning-parser deepseek_r1
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
generated_text = client.chat.completions.create(
model="RedHatAI/Phi-4-reasoning-FP8-dynamic",
messages=[
{"role": "user", "content": "Give me a short introduction to large language model."},
],
)
print(generated_text.choices[0].message.content)
Creation
Evaluation
The model was evaluated on the AIME25, GPQA Diamond and Mathh 500 benchmarks using lighteval, and on MMLU-Pro using lm-evaluation-harness. In both cases vLLM is used as the backend
Accuracy
| Benchmark | Phi-4-reasoning | Phi-4-reasoning-FP8-dynamic (this model) |
Recovery |
| AIME25 | 61.25 | 64.58 | 105.4% |
| GPQA Diamond | 64.65 | 66.50 | 102.9% |
| Math 500 | 90.01 | 88.60 | 98.4% |
| MMLU-Pro | 76.49 | 76.85 | 100.5% |
- Downloads last month
- 244
