Llama-4-Scout-17B-16E-Instruct-FP8-dynamic 👁 Model Icon

Model Overview

Model Architecture: Llama4ForConditionalGeneration
- Input: Text / Image
- Output: Text
Model Optimizations:
- Activation quantization: FP8
- Weight quantization: FP8
Release Date: 04/15/2025
Version: 1.0
Validated on: RHOAI 2.20, RHAIIS 3.0, RHELAI 1.5
Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing activations and weights of Llama-4-Scout-17B-16E-Instruct to FP8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. The llm-compressor library is used for quantization.

Deployment

This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.

Deploy on vLLM

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic"
number_gpus = 4

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

Evaluation

The model was evaluated on the OpenLLM leaderboard tasks (v1 and v2), long context RULER, multimodal MMMU, and multimodal ChartQA. All evaluations are obtained through lm-evaluation-harness.

Accuracy

	Recovery (%)	meta-llama/Llama-4-Scout-17B-16E-Instruct	RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic (this model)
ARC-Challenge 25-shot	100.36	69.37	69.62
GSM8k 5-shot	99.24	90.45	89.76
HellaSwag 10-shot	99.94	85.23	85.18
MMLU 5-shot	99.94	80.54	80.49
TruthfulQA 0-shot	99.17	61.41	60.90
WinoGrande 5-shot	98.88	77.90	77.03
OpenLLM v1 Average Score	99.59	77.48	77.16
IFEval 0-shot avg of inst and prompt acc	100.91	86.90	87.69
Big Bench Hard 3-shot	99.82	65.13	65.01
Math Lvl 5 4-shot	98.82	57.78	57.10
GPQA 0-shot	100.53	31.88	32.05
MuSR 0-shot	102.18	42.20	43.12
MMLU-Pro 5-shot	99.82	55.70	55.60
OpenLLM v2 Average Score	100.28	56.60	56.76
RULER seqlen = 131072 niah_multikey_1	101.36	88.20	89.40
RULER seqlen = 131072 niah_multikey_2	100.72	83.60	84.20
RULER seqlen = 131072 niah_multikey_3	96.19	78.80	75.80
RULER seqlen = 131072 niah_multiquery	100.79	95.40	96.15
RULER seqlen = 131072 niah_multivalue	97.22	73.75	71.70
RULER seqlen = 131072 niah_single_1	100.00	100.00	100.00
RULER seqlen = 131072 niah_single_2	100.00	99.80	99.80
RULER seqlen = 131072 niah_single_3	100.00	99.80	99.80
RULER seqlen = 131072 ruler_cwe	96.19	39.42	37.92
RULER seqlen = 131072 ruler_fwe	98.86	92.93	91.87
RULER seqlen = 131072 ruler_qa_hotpot	100.00	48.20	48.20
RULER seqlen = 131072 ruler_qa_squad	98.81	53.57	52.93
RULER seqlen = 131072 ruler_qa_vt	100.35	92.28	92.60
RULER seqlen = 131072 Average Score	99.49	80.44	80.03
MMMU 0-shot	97.92	53.44	52.33
ChartQA 0-shot exact_match	100.12	65.88	65.96
ChartQA 0-shot relaxed_accuracy	99.69	88.92	88.64
Multimodal Average Score	99.38	69.41	68.98

Downloads last month: 8,777

Safetensors

Model size

109B params

Tensor type

BF16

F8_E4M3

Model tree for RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

Base model

meta-llama/Llama-4-Scout-17B-16E

Finetuned

meta-llama/Llama-4-Scout-17B-16E-Instruct

Quantized

(36)

this model

Collections including RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

Quantized variants of the Llama 4 release by Meta. • 4 items • Updated Apr 30 • 2

May 2025 Collection of third-party generative AI models validated by Red Hat AI for use across the Red Hat AI Product Portfolio. • 39 items • Updated Apr 30 • 20

URL: https://huggingface.co/RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic

⇱ RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic · Hugging Face