ibm-research/granite-4.0-h-3b-ar
granite-4.0-h-3b-ar: A multilingual LLM for English and Arabic, including MSA and its regional dialects.
Model Summary
granite-4.0-h-3b-ar is a lightweight 3-billion-parameter instruct model developed through a collaboration between IBM Research and the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) as part of the IBM–MBZUAI AI Center of Excellence. Despite its compact size, the model delivers strong performance, making it well-suited for efficient deployment in resource-constrained environments.
The model builds upon the capabilities of granite-4.0-h-micro-base and uses the exact architecture (GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings), delivering enhanced performance across Modern Standard Arabic (MSA, ar) and multiple Arabic dialects, including Egyptian (arz), Moroccan (ary), Syrian/Levantine (apc), Saudi/Najdi (ars), and Emirati/Gulf (afb). At the same time, it retains strong proficiency in English (en), making it a robust and efficient multilingual model suitable for a wide range of cross-lingual and dialectal applications.
Key Technical Specifications
- Model Developers: IBM Research and MBZUAI (IBM–MBZUAI AI Center of Excellence)
- Languages: Arabic (MSA & dialects) and English
- Architecture: Decoder-only dense transformer architecture
- Parameters: 3 Billion
- Context Length: 8,192
- Vocabulary Size: 100,352
- Core architecture components: GQA, Mamba2, MLP with SwiGLU, RMSNorm, and shared input/output embeddings
Training Procedure
The enhanced Arabic and dialect-capable model is built starting from the granite-4.0-h-micro-base, following a carefully designed multi-stage pipeline. First, continued pretraining (CPT) adapts the base model to Arabic and its dialects by exposing it to large-scale bilingual data. This stage enables the model to acquire language-specific features and broaden its knowledge across multiple aspects of Arabic contexts, including culture, religion, and regional variations. Next, instruction fine-tuning (IFT) is applied to improve the model’s ability to follow Arabic and dialectal instructions precisely and consistently. Afterward, a layer-wise model merging strategy combines the fine-tuned model with the granite-4.0-h-micro model. This step is critical to ensure that the model retains its strong English capabilities while gaining enhanced Arabic performance. Finally, a second round of instruction fine-tuning is conducted on the merged model. This stage stabilizes the merged weights and further refines overall performance, resulting in a balanced, high-quality bilingual model that performs well across English, standard Arabic, and dialectal Arabic settings.
Training Data
The training data consists of a mixture of public English and Arabic datasets with permissive licenses, crawled MSA and dialectal data collected in accordance with IBM Data Acquisition guidelines, and synthetic data generated using open-weight large language models.
Rigorous data filtering and quality control procedures were applied to the crawled data. This included removing non-Arabic content using language identification, followed by both exact and fuzzy deduplication. Finally, additional filtering was performed using IBM’s Data Prep Kit to eliminate low-quality samples.
Given the scarcity of high-quality regional dialectal data, a large portion of FineWeb-Edu English educational content was translated into five Arabic dialects using GPT-OSS-120B as the teacher model. The generated translations were further filtered using a LLM-as-a-judge framework with multiple models, in order to ensure both knowledge accuracy and fluency.
Continued Pre-Training Data
In the Continued Pre-Training (CPT) stage, approximately 100B tokens of English and standard/dialectal Arabic data were used, with a roughly balanced distribution between the two languages. The training data spans diverse domains and is sourced from public datasets such as FineWeb, DCLM, HPLT, Webhose, and Wikipedia, as well as crawled and translated data. Extensive experiments were conducted to optimize the data mixture across these sources, with weights tuned on a heldout development set.
Instruction Fine-Tuning Data
Due to the limited availability of Arabic and dialectal datasets with permissive licenses, we rely on English datasets and translated data for instruction fine-tuning. Approximately one million examples are used in this stage. The data is mainly curated to support general-domain applications, while also including domain-specific examples for math, coding, and reasoning tasks.
Infrastructure
The granite-4.0-h-3b-ar language model was trained using an NVIDIA H100 cluster hosted in an IBM data center. The training infrastructure is designed to support large-scale distributed workloads and can scale to thousands of GPUs.
How to Use
Here is a simple example on how to use granite-4.0-h-3b-ar for text generation:
- Prompt: "أخبرني عن شركة أي بي إم وعن جامعة محمد بن زايد للذكاء الاصطناعي" (Tell me about IBM company and Mohamed bin Zayed University of Artificial Intelligence.)
Using HuggingFace
Install the following libraries:
pip install torch torchvision torchaudio
pip install accelerate
pip install transformers
Then, use the following code:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device= "cuda"
model_id = "ibm-research/granite-4.0-h-3b-ar"
messages = [
{"role": "user", "content": "أخبرني عن شركة أي بي إم وعن جامعة محمد بن زايد للذكاء الاصطناعي"},
]
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map=device,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
model.eval()
messages = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
input_tokens = tokenizer(messages, return_tensors="pt").to(device)
outputs = model.generate(**input_tokens, max_new_tokens=1024)
outputs = tokenizer.batch_decode(outputs)
print(outputs[0])
Expected output (actual output can vary):
<|start_of_role|>user<|end_of_role|>أخبرني عن شركة أي بي إم وعن جامعة محمد بن زايد للذكاء الاصطناعي<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|> 🇦🇪 أي بي إم، شركة عالمية كبيرة، تقدم خدمات تكنولوجيا معلومات وبرمجيات ومنتجات متنوعة. هي معروفة بتقنيتها المتقدمة وحلولها المبتكرة.
جامعة محمد بن زايد للذكاء الاصطناعي، مؤسسة رائدة في مجال الذكاء الاصطناعي، تهدف إلى تعزيز البحث والتطوير في هذا المجال. تسعى لتكون مركزًا للابتكار والتعاون في مجال الذكاء الاصطناعي.
<|end_of_text|>
Using vLLM (if a GPU is available)
from vllm import LLM, SamplingParams
model_id = "ibm-research/granite-4.0-h-3b-ar"
messages = [
{"role": "user", "content": "أخبرني عن شركة أي بي إم وعن جامعة محمد بن زايد للذكاء الاصطناعي"},
]
llm = LLM(
model=model_id,
dtype="bfloat16",
enforce_eager=False,
gpu_memory_utilization=0.6,
tensor_parallel_size=1, # Num. of GPUs
)
# Get tokenizer and format messages
tokenizer = llm.get_tokenizer()
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
sampling_params = SamplingParams(max_tokens=1024)
outputs = llm.generate(prompt, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
Expected output (actual output can vary):
كشركة عالمية مشهورة في مجال التكنولوجيا، تقدم شركة أي بي إم حلولًا متقدمة للشركات والحكومات حول العالم. كما تقوم بتوفير خدماتها للعديد من المؤسسات في دولة الإمارات العربية المتحدة.
فيما يتعلق بجامعة محمد بن زايد للذكاء الاصطناعي، فهي مبادرة رائدة أطلقتها الحكومة الإماراتية بهدف خلق مجتمع معرفي وتعزيز التقنيات المتقدمة. تسعى الجامعة إلى جذب أفضل العقول في مجال الذكاء الاصطناعي من جميع أنحاء العالم وتوفير الدعم البحثي والتعليمي لهم.
Evaluation
We perform the evaluation using the following multilingual benchmarks and the included languages. We used LM-Evaluation-Harness to evaluate different models in a 5-shot setting. In the table below, we list the language labels that are used in each benchmark.
| Benchmarks | # Type | Languages | Notes |
|---|---|---|---|
| Flores-200 | Translation | MSA <-> English, English <-> [Egyptian, Iraqi, Leventine, Morrocan, Najdi] | - |
| Alyah | QA | Emirati | UAE culture |
| INCLUDE 44 | QA | MSA | - |
| PalmX 2025 | QA | MSA | Arabic and Islamic culture |
| MMLU-ProX | QA | English, MSA | - |
| Belebele | QA | English, MSA, Egyptian, Iraqi, Leventine, Morrocan, Najdi | - |
| DialectalArabicMMLU | QA | English, MSA, Egyptian, Emirati, Morrocan, Saudi, Syrian | - |
| Global PIQA | QA | English, MSA, Egyptian, Iraqi, Leventine, Morrocan, Najdi | - |
Translation (5-shot)
The models are evaluated on the Floress-200 dataset, where the performance is measured using BLEU score. Here, Dial. represents the average performance over the dialects.
| Rank | Model | Size | Avg ↓ | En→MSA | MSA→En | Dial.→MSA | Dial.→En |
|---|---|---|---|---|---|---|---|
| 1 | 🟢 granite-4.0-h-3b-ar | 3.2 | 28.32 | 27.64 | 36.54 | 15.23 | 33.86 |
| 2 | UBC-NileChat-3B | 3.1 | 26.30 | 22.06 | 37.00 | 13.58 | 32.58 |
| 3 | gemma-3-4b-it | 4.3 | 25.06 | 22.68 | 35.72 | 12.53 | 29.31 |
| 4 | jais-fam-2p7b-chat | 2.7 | 21.84 | 19.79 | 32.07 | 9.94 | 25.58 |
| 5 | Nile-Chat-4B | 3.9 | 21.54 | 9.32 | 37.12 | 8.42 | 31.28 |
| 6 | Qwen3-4B-Instruct-2507 | 4.0 | 20.96 | 14.64 | 33.12 | 9.42 | 26.68 |
| 7 | g4.0-h-micro_hf | 3.2 | 20.70 | 16.52 | 30.87 | 9.25 | 26.17 |
| 8 | Falcon-H1-3B-Instruct | 3.1 | 19.52 | 12.40 | 32.31 | 7.28 | 26.11 |
| 9 | Atlas-Chat-2B | 2.6 | 19.27 | 10.00 | 32.38 | 7.01 | 27.71 |
| 10 | g3.3-2b-instruct | 2.5 | 16.64 | 12.11 | 26.80 | 7.37 | 20.30 |
Understanding and Cultural Knowledge (5-shot)
The models are evaluated on a suite of QA tasks in a generative mode. The performance is reported using exact match with strict-match as a requirement, where the model has to follow the instructions exactly as specified by the task. Below, (Dial) represents the average performance over the dialects.
| Rank | Model | Size | Avg ↓ | Belebele (En) | DialArMMLU (En) | MMLU-ProX (En) | Global PIQA (En) | Belebele (MSA) | DialArMMLU (MSA) | PalmX 2025 (MSA) | INCLUDE-base (MSA) | MMLU-ProX (MSA) | Global PIQA (MSA) | Alyah (Dial) | Belebele (Dial) | DialArMMLU (Dial) | Global PIQA (Dial) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3-4B-Instruct-2507 | 4.0 | 67.16 | 91.67 | 75.41 | 60.77 | 81.0 | 85.22 | 61.15 | 62.91 | 54.35 | 46.17 | 64.0 | 61.72 | 70.67 | 56.01 | 69.2 |
| 2 | 🟢 granite-4.0-h-3b-ar | 3.2 | 65.8 | 88.0 | 69.98 | 36.94 | 82.0 | 78.89 | 61.21 | 64.02 | 59.96 | 28.23 | 81.0 | 66.24 | 68.76 | 59.11 | 76.8 |
| 3 | UBC-NileChat-3B | 3.1 | 65.35 | 89.22 | 67.27 | 33.39 | 76.0 | 83.44 | 58.44 | 69.05 | 57.97 | 25.59 | 77.0 | 71.7 | 74.4 | 55.49 | 76.0 |
| 4 | gemma-3-4b-it | 4.3 | 62.68 | 87.78 | 64.69 | 39.8 | 78.0 | 76.89 | 54.13 | 62.21 | 54.35 | 30.15 | 77.0 | 64.96 | 63.67 | 49.96 | 74.0 |
| 5 | Falcon-H1-3B-Instruct | 3.1 | 61.58 | 90.44 | 71.48 | 52.06 | 74.0 | 80.0 | 52.63 | 58.39 | 50.36 | 25.44 | 66.0 | 58.91 | 63.62 | 47.92 | 70.8 |
| 6 | g4.0-h-micro_hf | 3.2 | 58.82 | 87.78 | 69.09 | 43.36 | 80.0 | 72.67 | 52.89 | 54.84 | 46.74 | 28.26 | 65.0 | 55.16 | 52.2 | 47.69 | 67.8 |
| 7 | Nile-Chat-4B | 3.9 | 58.68 | 83.67 | 61.82 | 26.69 | 73.0 | 74.89 | 51.55 | 58.39 | 50.91 | 19.35 | 72.0 | 61.3 | 66.38 | 48.77 | 72.8 |
| 8 | Atlas-Chat-2B | 2.6 | 53.46 | 82.33 | 59.01 | 25.52 | 73.0 | 67.56 | 44.37 | 61.07 | 39.49 | 12.42 | 58.0 | 62.92 | 56.29 | 42.7 | 63.8 |
| 9 | g3.3-2b-instruct | 2.5 | 51.07 | 80.44 | 58.28 | 34.2 | 78.0 | 58.0 | 41.98 | 49.95 | 37.14 | 18.42 | 65.0 | 48.93 | 43.29 | 38.91 | 62.4 |
| 10 | jais-fam-2p7b-chat | 2.7 | 48.82 | 67.89 | 40.19 | 18.39 | 71.0 | 64.33 | 42.23 | 55.38 | 28.99 | 16.06 | 62.0 | 63.0 | 50.96 | 37.82 | 65.2 |
Ethical Considerations and Limitations:
granite-4.0-h-3b-ar was trained on a mixture of English and Arabic data, including data covering four regional Arabic dialects. Although the model is designed to support Arabic and dialectal dialogue use cases, its performance may vary depending on the dialect, domain, prompt style, and task complexity, and may not always be comparable to its performance on English-language tasks. For dialect-specific or domain-specific applications, providing a small number of examples through few-shot prompting can help improve output accuracy and consistency. While granite-4.0-h-3b-ar has been developed with safety considerations in mind, it may still occasionally generate inaccurate, biased, inappropriate, or unsafe responses. Users and developers should conduct task-specific evaluation, safety testing, and additional tuning before deploying the model in production environments. For enterprise deployments, especially in safety-sensitive or high-risk settings, we recommend pairing granite-4.0-h-3b-ar with appropriate input and output moderation systems, such as Granite Guardian, to help detect and flag risks across relevant dimensions described in the IBM AI Risk Atlas.
citation
@misc{dialLLM,
title={granite-4.0-h-3b-ar},
author={IBM–MBZUAI AI Center of Excellence},
year={2026},
}
- Downloads last month
- 109
