Paper • 2601.06395 • Published • 5
Kabyle LoRA Adapter for AfriqueQwen3.5-4B
Fine-tuned LoRA adapter for Kabyle (kab) text generation, based on McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs.
This adapter eliminates the language confusion (Swahili, Somali, Igbo bleeding) present in the base model and produces coherent, grammatically correct Kabyle sentences.
Training Details
| Parameter | Value |
|---|---|
| Base model | McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs |
| Training data | 326,147 filtered Kabyle sentences from Tatoeba |
| Raw data source | Tatoeba sentences dump (sentences.tar.bz2) |
| Fine-tuning steps | 1000 |
| Final training loss | 0.665 |
| LoRA rank (r) | 8 |
| LoRA alpha | 16 |
| Target modules | q_proj, v_proj |
| Max sequence length | 64 |
| Per-device batch size | 1 |
| Gradient accumulation | 4 (effective batch = 4) |
| Learning rate | 2e-4 |
| Optimizer | adamw_8bit |
| Quantization | 4-bit (NF4) with double quantization |
| Gradient checkpointing | Enabled |
| Hardware | NVIDIA T4 (Google Colab Free Tier) |
Dataset Filtering
- Source: Tatoeba sentences dump - 787,648 raw Kabyle sentences (lang == "kab")
- Quality filters applied:
- Sentence length: 3-50 words
- Minimum 2 Kabyle-specific characters: ɣ ṭ ḍ č ǧ ɛ ḥ ṛ ṣ ẓ
- Contamination removal: excluded sentences containing Greek/Cyrillic look-alikes:
- Greek epsilon ε (U+03B5)
- Cyrillic epsilon Ԑ (U+0510), ԑ (U+0511)
- Greek gamma γ (U+03B3), Γ (U+0393)
- Result: 326,147 clean, filtered Kabyle sentences
Results
Before vs After Fine-Tuning
| Prompt | Base Model (Before) | Fine-Tuned (Step 1000) |
|---|---|---|
| Ur bɣiɣ ara | ad aɣ-yeǧǧen ad d-nernu iḍelli, ma d am (nonsense) | ad t-ttɣeṣbeḍ. |
| Anwa i | iji ihe eji eme ihe na-eme (Igbo/Yoruba!) | d-yeǧǧan Ḥmed ad yeddu? |
| Lemmer ad | ziwa kuti adzafunikanso (Swahili!) | yuɣal ɣer-sen, ad d-yuɣal deg-sen. |
| Aqcic i | iyo Faysal oo ka hadlay Dastuurka Soomaaliya (Somali!) | iḍelli. |
| Aselkim-iw ur | d-yeɣli yiwen (nonsense) | yelli ara deg wexxam. |
| Ass-a d | ass n 15 di yennayer 2019 (date regurgitation) | dɣa, ur d-ḥewwweṣeɣ ara. |
Loss Progression
| Step | Training Loss |
|---|---|
| 600 | 0.712 |
| 850 | 0.706 |
| 950 | 0.677 |
| 1000 | 0.665 |
Key improvements:
- Zero language confusion - no more Swahili, Somali, or Igbo bleeding
- Grammatically correct - proper preverbs (ad, d-), clitic pronouns (t-, i), possessives (-iw)
- Semantically coherent - sentences make sense in context
- Natural endings - completes with periods and logical conclusions
- Cultural references - recognizes Kabyle names (Ḥmed) and places (Iɣil Azwaw)
Usage
Requirements
pip install transformers peft accelerate bitsandbytes torch
Load and Generate
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import torch
BASE_MODEL = "McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs"
ADAPTER_MODEL = "boffire/AfriqueQwen3.5-4B-Kabyle-LoRA"
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
# Load tokenizer and base model
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, trust_remote_code=True)
base_model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, ADAPTER_MODEL)
# Generate Kabyle text
prompt = "Taqbaylit d"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=25,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
completion = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(completion)
# Output: Taqbaylit d tamaziɣt.
Example Prompts
test_prompts = [
"Ur bɣiɣ ara",
"Taqbaylit-iw d",
"Anwa i",
"Nekkni n",
"Aselkim-iw ur",
"Iḍelli tella",
"Lemmer ad",
"Aqcic i",
"Ur tett ara",
"Ass-a d",
]
Repository Contents
| File | Description |
|---|---|
| adapter_config.json | LoRA hyperparameters |
| adapter_model.safetensors | Trained LoRA weights (~2-4 MB) |
| tokenizer_config.json | Tokenizer configuration |
| tokenizer.json | Tokenizer vocabulary |
| special_tokens_map.json | Special token mappings |
| README.md | This file |
Base Model
This adapter is designed to be used with:
- Base: McGill-NLP/AfriqueQwen3.5-4B-ExtendedCM-ExtendedLangs
- Type: Causal Language Model (Base/Pre-trained)
- Parameters: 4B
- Context Length: 32,768 tokens
- African Languages: 50 (including Kabyle)
Citation
If you use this model, please cite:
AfriqueLLM paper:
@misc{yu2026afriquellmdatamixingmodel,
title={AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages},
author={Hao Yu and Tianyi Xu and Michael A. Hedderich and Wassim Hamidouche and Syed Waqas Zamir and David Ifeoluwa Adelani},
year={2026},
eprint={2601.06395},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.06395}
}
Tatoeba project:
Limitations
Model Architecture
- Base model, not chat: This adapter performs text completion only. It does not follow instructions, answer questions, or engage in conversation. For chat capabilities, further supervised fine-tuning (SFT) with instruction-response pairs is required.
- No chat template: The model does not recognize system/user/assistant roles. Inputs are treated as raw text to be continued.
- Small LoRA rank (r=8): Only ~917K parameters are trainable (0.02% of total). Complex reasoning, multi-step logic, or rare morphological patterns may be beyond capacity.
Training and Data
- Short context window: Trained on max 64 tokens. Generation beyond ~64 tokens may degrade in quality or repeat phrases.
- Domain limited: Training data (Tatoeba) consists primarily of short, general-domain sentences. The model lacks exposure to technical, legal, medical, or academic Kabyle.
- No dialect awareness: Does not distinguish between Kabyle sub-dialects (At Mengellat, At Weɣlis, Tasaḥlit, etc.). Output may blend dialectal features.
- Script limited: Trained exclusively on Latin-script Kabyle. Tifinagh and Arabic-script Kabyle are not supported.
Semantic Coherence
- Grammatically correct but semantically inconsistent: The model reliably generates morphologically valid Kabyle (proper preverbs, clitics, possessives) but may produce sentences that are grammatically well-formed yet semantically odd or factually nonsensical. This is expected behavior for a base causal language model fine-tuned on text completion rather than instruction-following or reasoning tasks.
- Examples of semantic drift:
- "Aselkim-iw ur yelli ara deg wexxam" (grammatical but semantically odd: "My computer is not at home")
- "Taqbaylit-iw d taɛrabt i tt-yeččan" (contradiction: "My Kabyle was eaten by Arabic")
- Cause: The model predicts the most statistically likely next token based on training patterns. It does not understand meaning, fact-check, or reason about the real world. Lower sampling temperature or greedy decoding (
do_sample=False) improves consistency at the cost of creativity.
Content and Safety
- No safety filtering: The model has no built-in guardrails. It may generate toxic, biased, harmful, or culturally inappropriate content if prompted.
- Hallucination risk: As a base language model, it has no factual grounding. It may invent false information about Kabyle history, people, places, or events.
- Gender bias: Training data may reflect historical gender stereotypes present in Tatoeba contributions.
- Temporal cutoff: Factual knowledge is limited to the base model's training cutoff. No awareness of events after approximately 2024.
Multilingual Behavior
- French interference: Despite fine-tuning, the base model's strong French knowledge may cause French words to appear in Kabyle completions, especially for modern/technical concepts.
- Code-switching untested: Natural Kabyle-French code-switching (common in Algeria) is not explicitly handled and may produce unpredictable results.
- No translation capability: The model was not trained on parallel data. English to Kabyle or Kabyle to French translation will be unreliable.
Compute and Deployment
- 4-bit quantization required: The adapter assumes 4-bit (NF4) base model loading. Running in full precision requires ~8GB VRAM; FP16 requires ~16GB.
- GPU recommended: CPU inference is possible but extremely slow (~10-30x slower than T4 GPU).
License
This model is released under the CC BY 4.0 License (same as the base model).
Acknowledgments
- McGill-NLP for the AfriqueLLM suite
- Tatoeba contributors for the Kabyle sentence corpus
- Qwen team for the base architecture
- Hugging Face for the PEFT and Transformers libraries
- Downloads last month
- 86
Model tree for boffire/AfriqueQwen3.5-4B-Kabyle-LoRA
Base model
Qwen/Qwen3.5-4B-Base Finetuned
McGill-NLP/AfriqueQwen3.5-4B-50Langs