VOOZH about

URL: https://huggingface.co/ManarAlrabie/arabic-llm-curated-500k

⇱ ManarAlrabie/arabic-llm-curated-500k · Hugging Face


ALLaM-7B-Instruct — Curated 500K

LoRA adapter for ALLaM-7B-Instruct-preview, fine-tuned on the human-curated Arabic instruction dataset CIDAR under a fixed budget of 500K training tokens. One of six adapters from a controlled study comparing human-curated versus synthetic Arabic instruction data under matched token budgets.

Model Details

  • Base model: humain-ai/ALLaM-7B-Instruct-preview
  • Adapter type: LoRA (QLoRA, 4-bit NF4)
  • Training data: CIDAR (human-curated)
  • Token budget: 500K tokens
  • Language: Arabic

Training Configuration

Setting Value
Quantization 4-bit NF4 (QLoRA)
LoRA rank / alpha 16 / 32
LoRA dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Optimizer Paged AdamW (8-bit)
Learning rate 2e-4
LR scheduler cosine, 100 warmup steps
Epochs 3
Effective batch size 16 (2 × 8 grad. accum.)
Max sequence length 512
Precision fp16
Seed 42
Hardware NVIDIA A100 (40GB)

Evaluation

Evaluated with lm-evaluation-harness on seven Arabic benchmarks (ACVA 5-shot; others zero-shot). Accuracy:

Benchmark Score
Arab Culture 0.361
AlGhafa 0.594
AraDiCE 0.590
ACVA 0.770
Arabic Exams 0.508
ArabicMMLU 0.645
OpenAI MMLU (Ar) 0.412

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "humain-ai/ALLaM-7B-Instruct-preview"
tok = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base, trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(model, "ManarAlrabie/arabic-llm-curated-500k")

Intended Use & Limitations

Research artifact for studying instruction-data quality vs. quantity in Arabic LLM fine-tuning. Not intended for production. As a single-seed fine-tune of a 7B model, outputs may contain inaccuracies.

Citation

Associated paper is under review; citation will be added upon publication. Until then, please link to this repository.

Downloads last month
30

Model tree for ManarAlrabie/arabic-llm-curated-500k

Adapter
(18)
this model

Dataset used to train ManarAlrabie/arabic-llm-curated-500k