Llama-4-Scout-17B-16E-Instruct-quantized.w4a16
👁 Model Icon
👁 Validated BadgeModel Overview
- Model Architecture: Llama4ForConditionalGeneration
- Input: Text / Image
- Output: Text
- Model Optimizations:
- Activation quantization: None
- Weight quantization: INT4
- Release Date: 04/25/2025
- Version: 1.0
- Validated on: RHOAI 2.20, RHAIIS 3.0, RHELAI 1.5
- Model Developers: Red Hat (Neural Magic)
Model Optimizations
This model was obtained by quantizing weights of Llama-4-Scout-17B-16E-Instruct to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements by approximately 75%. Weight quantization also reduces disk size requirements by approximately 75%. The llm-compressor library is used for quantization.
Deployment
This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
Deploy on vLLM
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16"
number_gpus = 4
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
prompt = "Give me a short introduction to large language model."
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompt, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
Evaluation
The model was evaluated on the OpenLLM leaderboard tasks (v1 and v2), long context RULER, multimodal MMMU, and multimodal ChartQA. All evaluations are obtained through lm-evaluation-harness.
Accuracy
| Recovery (%) | meta-llama/Llama-4-Scout-17B-16E-Instruct | RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 (this model) |
|
|---|---|---|---|
| ARC-Challenge 25-shot |
98.51 | 69.37 | 68.34 |
| GSM8k 5-shot |
100.4 | 90.45 | 90.90 |
| HellaSwag 10-shot |
99.67 | 85.23 | 84.95 |
| MMLU 5-shot |
99.75 | 80.54 | 80.34 |
| TruthfulQA 0-shot |
99.82 | 61.41 | 61.30 |
| WinoGrande 5-shot |
98.98 | 77.90 | 77.11 |
| OpenLLM v1 Average Score |
99.59 | 77.48 | 77.16 |
| IFEval 0-shot avg of inst and prompt acc |
99.51 | 86.90 | 86.47 |
| Big Bench Hard 3-shot |
99.46 | 65.13 | 64.78 |
| Math Lvl 5 4-shot |
99.22 | 57.78 | 57.33 |
| GPQA 0-shot |
100.0 | 31.88 | 31.88 |
| MuSR 0-shot |
100.9 | 42.20 | 42.59 |
| MMLU-Pro 5-shot |
98.67 | 55.70 | 54.96 |
| OpenLLM v2 Average Score |
99.54 | 56.60 | 56.34 |
| MMMU 0-shot |
100.6 | 53.44 | 53.78 |
| ChartQA 0-shot exact_match |
100.1 | 65.88 | 66.00 |
| ChartQA 0-shot relaxed_accuracy |
99.55 | 88.92 | 88.52 |
| Multimodal Average Score | 100.0 | 69.41 | 69.43 |
| RULER seqlen = 131072 niah_multikey_1 |
98.41 | 88.20 | 86.80 |
| RULER seqlen = 131072 niah_multikey_2 |
94.73 | 83.60 | 79.20 |
| RULER seqlen = 131072 niah_multikey_3 |
96.44 | 78.80 | 76.00 |
| RULER seqlen = 131072 niah_multiquery |
98.79 | 95.40 | 94.25 |
| RULER seqlen = 131072 niah_multivalue |
101.6 | 73.75 | 74.95 |
| RULER seqlen = 131072 niah_single_1 |
100.0 | 100.00 | 100.0 |
| RULER seqlen = 131072 niah_single_2 |
100.0 | 99.80 | 99.80 |
| RULER seqlen = 131072 niah_single_3 |
100.2 | 99.80 | 100.0 |
| RULER seqlen = 131072 ruler_cwe |
87.39 | 39.42 | 33.14 |
| RULER seqlen = 131072 ruler_fwe |
98.13 | 92.93 | 91.20 |
| RULER seqlen = 131072 ruler_qa_hotpot |
100.4 | 48.20 | 48.40 |
| RULER seqlen = 131072 ruler_qa_squad |
96.22 | 53.57 | 51.55 |
| RULER seqlen = 131072 ruler_qa_vt |
98.82 | 92.28 | 91.20 |
| RULER seqlen = 131072 Average Score |
98.16 | 80.44 | 78.96 |
- Downloads last month
- 6,644
Model tree for RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16
Base model
meta-llama/Llama-4-Scout-17B-16E