VOOZH about

URL: https://huggingface.co/RedHatAI/granite-4.0-h-tiny-FP8-dynamic

⇱ RedHatAI/granite-4.0-h-tiny-FP8-dynamic · Hugging Face


Granite-4.0-h-tiny-FP8-dynamic 👁 Model Icon

👁 Validated Badge

Model Overview

  • Model Architecture: GraniteMoeHybridForCausalLM
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: FP8
    • Activation quantization: FP8
  • Release Date:
  • Version: 1.0
  • Model Developers:: Red Hat
  • ModelCar Storage URI: oci://registry.redhat.io/rhai/modelcar-granite-4-0-h-tiny-fp8-dynamic:3.0
  • Validated on vLLM: 0.13.0
  • Validated on RHAIIS: 3.3
  • Validated on RHOAI: 3.3

Quantized version of ibm-granite/granite-4.0-h-tiny.

Model Optimizations

This model was obtained by quantizing the weights and activations of ibm-granite/granite-4.0-h-tiny to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks of the language model are quantized.

Deployment

Use with vLLM

  1. Install vLLM from main:
uv pip install -U git+https://github.com/vllm-project/vllm.git \
 --extra-index-url https://wheels.vllm.ai/nightly \
 --no-deps \
 --no-cache


uv pip install compressed-tensors==0.12.3a20251114 --no-cache
uv pip install --upgrade torchvision --break-system-packages --no-cache
uv pip install cloudpickle msgspec zmq blake3 cachetools prometheus_client fastapi openai openai_harmony pybase64 llguidance diskcache xgrammar lm-format-enforcer partial-json-parser cbor2 einops gguf numba --no-cache
  1. Initialize vLLM server:
vllm serve RedHatAI/granite-4.0-h-tiny-FP8-dynamic --tensor_parallel_size 1
  1. Send requests to the server:
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
 api_key=openai_api_key,
 base_url=openai_api_base,
)

model = "RedHatAI/granite-4.0-h-tiny-FP8-dynamic"

messages = [
 {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]


outputs = client.chat.completions.create(
 model=model,
 messages=messages,
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was quantized using the llm-compressor library as shown below.

Evaluation

The model was evaluated on the OpenLLM leaderboard task, using lm-evaluation-harness. vLLM was used for all evaluations.

Accuracy Comparison

Category Benchmark ibm-granite/granite-4.0-h-tiny RedHatAI/granite-4.0-h-tiny-FP8-dynamic Recovery (%)
OpenLLM V1 ARC-Challenge (Acc, 25-shot) 62.97 62.37 99.05
GSM8K (Strict-Match, 5-shot) 80.44 79.83 99.24
HellaSwag (Acc-Norm, 10-shot) 61.75 61.56 99.69
MMLU (Acc, 5-shot) 66.46 66.33 99.80
TruthfulQA (MC2, 0-shot) 58.48 58.11 99.37
Winogrande (Acc, 5-shot) 71.43 72.30 101.22
Average 66.92 66.75 99.73
OpenLLM V2 IFEval (Inst Level Strict Acc, 0-shot) 70.62 71.10 100.68
MMLU-Pro (Acc, 5-shot) 46.24 46.05 99.59
Average 58.43 58.58 100.13
Downloads last month
7,171
Safetensors
Model size
7B params
Tensor type
BF16
·
F8_E4M3
·

Model tree for RedHatAI/granite-4.0-h-tiny-FP8-dynamic

Quantized
(34)
this model

Collections including RedHatAI/granite-4.0-h-tiny-FP8-dynamic