DistilBERT Kabyle-Tachelhit Language Classifier
A fine-tuned DistilBERT model for binary classification between Kabyle (kab) and Tachelhit (shi) languages, two major Berber/Tamazight languages of North Africa.
Model Details
| Attribute | Value |
|---|---|
| Architecture | DistilBERT (distilbert-base-multilingual-cased) |
| Parameters | ~66M (135M total with classifier head) |
| Task | Text Classification (Binary: Kabyle vs Tachelhit) |
| Languages | Kabyle (kab), Tachelhit (shi) |
| Fine-tuned from | distilbert-base-multilingual-cased |
| Training data | 31,822 sentences (15,911 per class) |
| Test accuracy | 91.9% on real-world sentences |
Training Data
Kabyle (kab)
- Source: Mozilla Common Voice Kabyle corpus (cleaned)
- Size: 15,911 training sentences
- Preprocessing: Greek epsilon (ε) normalized to Latin open E (ɛ), Greek gamma (γ) to Latin gamma (ɣ), Turkish ğ to Kabyle ǧ
- Splits: Train/Dev/Test from Common Voice validated clips
Tachelhit (shi)
- Primary source: Tatoeba corpus (22,673 sentences)
- Secondary source: Mozilla Data Collective (230 clips)
- Size: 15,911 training sentences (balanced with Kabyle)
- Total unique: 22,730 sentences after deduplication
Performance
Test Set Results (6,820 sentences)
| Metric | Value |
|---|---|
| Accuracy | 98.86% |
| F1 Score | 98.85% |
| Precision | 99.14% |
| Recall | 98.56% |
Real-World Evaluation (37 diverse sentences)
| Metric | v1 (Original) | v2 (Updated) | Improvement |
|---|---|---|---|
| Accuracy | 75.7% | 91.9% | +16.2 pp |
Key Improvements Over v1
- Fixed encoding errors (Greek ε → Latin ɛ, γ → ɣ, ğ → ǧ)
- Added 22,673 Tachelhit sentences from Tatoeba
- Proper label mapping (kab/tach instead of generic LABEL_0/LABEL_1)
- Fixed notorious "Ihi" false positive (72% → 99.8% confidence)
- Eliminated high-confidence errors (>90% wrong in v1)
Usage
Basic Classification
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="boffire/distilbert-kabyle-tachelhit-classifier-v2",
tokenizer="boffire/distilbert-kabyle-tachelhit-classifier-v2"
)
# Kabyle sentence
result = classifier("Acuɣer ur d-yusi ara?")
print(result) # [{'label': 'kab', 'score': 1.000}]
# Tachelhit sentence
result = classifier("Ifl Ṭum Mary.")
print(result) # [{'label': 'tach', 'score': 1.000}]
Robust Classification with Confidence Thresholding
For production use, apply confidence thresholds based on sentence length:
def classify_robust(sentence, classifier, min_words=4):
if not sentence or len(sentence.strip()) == 0:
return "EMPTY", 0.0, "REJECT"
word_count = len(sentence.split())
result = classifier(sentence)[0]
label = result['label']
confidence = result['score']
# Stricter thresholds for short sentences
if word_count < min_words:
high_threshold = 0.95
low_threshold = 0.85
else:
high_threshold = 0.90
low_threshold = 0.70
if confidence >= high_threshold:
status = "HIGH_CONF"
elif confidence >= low_threshold:
status = "LOW_CONF"
else:
status = "REJECT"
label = "ambiguous"
return label, confidence, status
# Example usage
classifier = pipeline(
"text-classification",
model="boffire/distilbert-kabyle-tachelhit-classifier-v2",
tokenizer="boffire/distilbert-kabyle-tachelhit-classifier-v2"
)
sentences = [
"Acuɣer ur d-yusi ara?",
"Ifl Ṭum Mary.",
"Ihi",
"Aql-aɣ nettaẓ ar zdat.",
]
for sent in sentences:
label, conf, status = classify_robust(sent, classifier)
print(f"[{status}] {label} ({conf:.3f}): {sent}")
Recommended Pipeline for Common Voice Data Cleaning
def cv_filter_pipeline(sentences, classifier):
auto_accept = []
manual_review = []
for sent in sentences:
label, conf, status = classify_robust(sent, classifier)
if status == "HIGH_CONF":
auto_accept.append({
"sentence": sent,
"label": label,
"confidence": conf
})
else:
manual_review.append({
"sentence": sent,
"predicted": label,
"confidence": conf,
"status": status
})
return auto_accept, manual_review
Limitations and Biases
Known Limitations
- Short sentences: Very short utterances (< 3 words) like "Ihi", "Wah", "Azul" are inherently ambiguous and may be misclassified even with high confidence
- Sentence length bias: The model performs better on longer sentences (5+ words) with clear morphological markers
- Domain mismatch: Training data is mostly written text (Tatoeba) and read speech (Common Voice). Spontaneous speech may differ.
- Loanwords: Sentences with heavy French/Arabic loanwords may be ambiguous
- Dialectal variation: Tachelhit has significant regional variation not fully captured
⚠️ Out-of-Distribution Detection
This model is a binary classifier (kab vs tach). It has no "other" or "non-Berber" class.
Do not use on non-Berber text without a pre-filter. As noted in v1, the model will "forcefully route" any input into one of two classes. This applies not only to other Amazigh varieties (Chaoui, Rifian, etc.) but to any non-Berber language — French, English, Arabic, or others will be misclassified, often with misleadingly high confidence.
Recommended pipeline:
# Stage 1: General LID pre-filter (FastText/OpenLID-v3)
general_pred = fasttext_model.predict(text, k=1)
lang = general_pred[0][0]
# Only run v2 on Berber-like text
if lang in ["__label__kab_Latn", "__label__shi_Latn", "__label__ber_Latn"]:
result = v2_classifier(text)[0] # kab vs tach
else:
return "NON_BERBER", 0.0, "REJECT"
Example failure:
- Input:
"Bonjour à tous comment allez vous ?"(French) - v2 output:
kab (0.997)— wrong with near-certain confidence
Biases
- Gender: Training data skews male (55.4% male vs 18.2% female in Kabyle Common Voice)
- Age: Overrepresented: thirties (27.4%), fifties (19.8%); Underrepresented: teens, elderly
- Geographic: Tachelhit data from Tatoeba may not represent all Tachelhit-speaking regions
- Orthographic: Model assumes standard Kabyle orthography (Mammeri/INALCO). Non-standard spellings may fail.
Ethical Considerations
- This model is intended for language identification and data filtering, not for making decisions about individuals
- Low-resource language context: Both Kabyle and Tachelhit are under-resourced languages. Errors have real impact on dataset quality and downstream ASR/TTS systems.
- Community involvement: Model development should involve native speaker communities for validation and feedback.
Citation
If you use this model, please cite:
@misc{boffire2026kabyletachelhit,
author = {boffire},
title = {DistilBERT Kabyle-Tachelhit Language Classifier},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/boffire/distilbert-kabyle-tachelhit-classifier-v2}}
}
Related Models
- Original v1 model - Predecessor with generic labels and corrupted training data
- Kabyle ASR models - NVIDIA Conformer for Kabyle speech recognition
- Kabyle POS tagger - Part-of-speech tagging for Kabyle
Contact
For issues, improvements, or contributions, please open an issue on the HuggingFace repository or contact the maintainer.
This model was developed as part of Kabyle NLP and Common Voice community efforts to improve digital language resources for Berber languages.
- Downloads last month
- 64
