Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16

Model Overview

Model Architecture: Llama4ForConditionalGeneration
- Input: Text / Image
- Output: Text
Model Optimizations:
- Weight quantization: INT4
Release Date: 06/12/2025
Version: 1.0
Model Developers: Red Hat (Neural Magic)

Model Optimizations

This model was obtained by quantizing weights of Llama-4-Maverick-17B-128E-Instruct to INT4 data type. This optimization reduces the number of bits used to represent weights from 16 to 4, reducing GPU memory requirements by approximately 75%. Weight quantization also reduces disk size requirements by approximately 75%. The llm-compressor library is used for quantization.

Deployment

This model can be deployed efficiently on vLLM.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16"
number_gpus = 8

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

prompt = "Give me a short introduction to large language model."

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompt, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

Evaluation

The model was evaluated on the OpenLLM v1 leaderboard task, using lm-evaluation-harness. More evaluations are under way.

Accuracy

	Recovery (%)	meta-llama/Llama-4-Maverick-17B-128E-Instruct	RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16 (this model)
ARC-Challenge 25-shot	96.6	73.55	71.08
GSM8k 5-shot	99.7	93.18	92.87
HellaSwag 10-shot	99.6	87.27	86.95
MMLU 5-shot	99.8	85.98	85.78
TruthfulQA 0-shot	100.0	62.81	62.85
WinoGrande 5-shot	100.5	78.53	78.93
OpenLLM v1 Average Score	99.4	80.22	79.74

Downloads last month: 3,364

Safetensors

Model size

405B params

Tensor type

BF16

I64

I32

Model tree for RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16

Base model

meta-llama/Llama-4-Maverick-17B-128E

Finetuned

meta-llama/Llama-4-Maverick-17B-128E-Instruct

Quantized

(18)

this model

Collection including RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16

Quantized variants of the Llama 4 release by Meta. • 4 items • Updated Apr 30 • 2

URL: https://huggingface.co/RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16

⇱ RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16 · Hugging Face