VOOZH about

URL: https://huggingface.co/ManarAlrabie/arabic-llm-curated-1m

⇱ ManarAlrabie/arabic-llm-curated-1m · Hugging Face


ALLaM-7B-Instruct — Curated 1M

LoRA adapter for ALLaM-7B-Instruct-preview, fine-tuned on the human-curated Arabic instruction dataset CIDAR under a fixed budget of 1M training tokens. One of six adapters from a controlled study comparing human-curated versus synthetic Arabic instruction data under matched token budgets.

Model Details

  • Base model: humain-ai/ALLaM-7B-Instruct-preview
  • Adapter type: LoRA (QLoRA, 4-bit NF4)
  • Training data: CIDAR (human-curated)
  • Token budget: 1M tokens
  • Language: Arabic

Training Configuration

Setting Value
Quantization 4-bit NF4 (QLoRA)
LoRA rank / alpha 16 / 32
LoRA dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Optimizer Paged AdamW (8-bit)
Learning rate 2e-4
LR scheduler cosine, 100 warmup steps
Epochs 3
Effective batch size 16 (2 × 8 grad. accum.)
Max sequence length 512
Precision fp16
Seed 42
Hardware NVIDIA A100 (40GB)

Evaluation

Evaluated with lm-evaluation-harness on seven Arabic benchmarks (ACVA 5-shot; others zero-shot). Accuracy:

Benchmark Score
Arab Culture 0.355
AlGhafa 0.584
AraDiCE 0.590
ACVA 0.775
Arabic Exams 0.514
ArabicMMLU 0.644
OpenAI MMLU (Ar) 0.426

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = "humain-ai/ALLaM-7B-Instruct-preview"
tok = AutoTokenizer.from_pretrained(base, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(base, trust_remote_code=True, device_map="auto")
model = PeftModel.from_pretrained(model, "ManarAlrabie/arabic-llm-curated-1m")

Intended Use & Limitations

Research artifact for studying instruction-data quality vs. quantity in Arabic LLM fine-tuning. Not intended for production. As a single-seed fine-tune of a 7B model, outputs may contain inaccuracies.

Citation

Associated paper is under review; citation will be added upon publication. Until then, please link to this repository.

Downloads last month
13

Model tree for ManarAlrabie/arabic-llm-curated-1m

Adapter
(18)
this model

Dataset used to train ManarAlrabie/arabic-llm-curated-1m