Quantized Granite models from IBM Research. • 18 items • Updated • 2
Granite-4.0-h-tiny-FP8-dynamic
👁 Model Icon
👁 Validated BadgeModel Overview
- Model Architecture: GraniteMoeHybridForCausalLM
- Input: Text
- Output: Text
- Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
- Release Date:
- Version: 1.0
- Model Developers:: Red Hat
- ModelCar Storage URI: oci://registry.redhat.io/rhai/modelcar-granite-4-0-h-tiny-fp8-dynamic:3.0
- Validated on vLLM: 0.13.0
- Validated on RHAIIS: 3.3
- Validated on RHOAI: 3.3
Quantized version of ibm-granite/granite-4.0-h-tiny.
Model Optimizations
This model was obtained by quantizing the weights and activations of ibm-granite/granite-4.0-h-tiny to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.
Deployment
Use with vLLM
- Install vLLM from main:
uv pip install -U git+https://github.com/vllm-project/vllm.git \
--extra-index-url https://wheels.vllm.ai/nightly \
--no-deps \
--no-cache
uv pip install compressed-tensors==0.12.3a20251114 --no-cache
uv pip install --upgrade torchvision --break-system-packages --no-cache
uv pip install cloudpickle msgspec zmq blake3 cachetools prometheus_client fastapi openai openai_harmony pybase64 llguidance diskcache xgrammar lm-format-enforcer partial-json-parser cbor2 einops gguf numba --no-cache
- Initialize vLLM server:
vllm serve RedHatAI/granite-4.0-h-tiny-FP8-dynamic --tensor_parallel_size 1
- Send requests to the server:
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
model = "RedHatAI/granite-4.0-h-tiny-FP8-dynamic"
messages = [
{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]
outputs = client.chat.completions.create(
model=model,
messages=messages,
)
generated_text = outputs.choices[0].message.content
print(generated_text)
Creation
This model was quantized using the llm-compressor library as shown below.
Evaluation
The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations.
Accuracy Comparison
| Category | Benchmark | ibm-granite/granite-4.0-h-tiny | RedHatAI/granite-4.0-h-tiny-FP8-dynamic | Recovery (%) |
|---|---|---|---|---|
| OpenLLM V1 | ARC-Challenge (Acc, 25-shot) | 62.97 | 62.37 | 99.05 |
| GSM8K (Strict-Match, 5-shot) | 80.44 | 79.83 | 99.24 | |
| HellaSwag (Acc-Norm, 10-shot) | 61.75 | 61.56 | 99.69 | |
| MMLU (Acc, 5-shot) | 66.46 | 66.33 | 99.80 | |
| TruthfulQA (MC2, 0-shot) | 58.48 | 58.11 | 99.37 | |
| Winogrande (Acc, 5-shot) | 71.43 | 72.30 | 101.22 | |
| Average | 66.92 | 66.75 | 99.73 | |
| OpenLLM V2 | IFEval (Inst Level Strict Acc, 0-shot) | 70.62 | 71.10 | 100.68 |
| MMLU-Pro (Acc, 5-shot) | 46.24 | 46.05 | 99.59 | |
| Average | 58.43 | 58.58 | 100.13 |
- Downloads last month
- 7,171
Safetensors
Model size
7B params
Tensor type
BF16
·
F8_E4M3 ·
Model tree for RedHatAI/granite-4.0-h-tiny-FP8-dynamic
Base model
ibm-granite/granite-4.0-h-tiny