DistilBERT Kabyle-Tachelhit Language Classifier

A fine-tuned DistilBERT model for binary classification between Kabyle (kab) and Tachelhit (shi) languages, two major Berber/Tamazight languages of North Africa.

Model Details

Attribute	Value
Architecture	DistilBERT (distilbert-base-multilingual-cased)
Parameters	~66M (135M total with classifier head)
Task	Text Classification (Binary: Kabyle vs Tachelhit)
Languages	Kabyle (kab), Tachelhit (shi)
Fine-tuned from	`distilbert-base-multilingual-cased`
Training data	31,822 sentences (15,911 per class)
Test accuracy	91.9% on real-world sentences

Training Data

Kabyle (kab)

Source: Mozilla Common Voice Kabyle corpus (cleaned)
Size: 15,911 training sentences
Preprocessing: Greek epsilon (ε) normalized to Latin open E (ɛ), Greek gamma (γ) to Latin gamma (ɣ), Turkish ğ to Kabyle ǧ
Splits: Train/Dev/Test from Common Voice validated clips

Tachelhit (shi)

Primary source: Tatoeba corpus (22,673 sentences)
Secondary source: Mozilla Data Collective (230 clips)
Size: 15,911 training sentences (balanced with Kabyle)
Total unique: 22,730 sentences after deduplication

Performance

Test Set Results (6,820 sentences)

Metric	Value
Accuracy	98.86%
F1 Score	98.85%
Precision	99.14%
Recall	98.56%

Real-World Evaluation (37 diverse sentences)

Metric	v1 (Original)	v2 (Updated)	Improvement
Accuracy	75.7%	91.9%	+16.2 pp

Key Improvements Over v1

Fixed encoding errors (Greek ε → Latin ɛ, γ → ɣ, ğ → ǧ)
Added 22,673 Tachelhit sentences from Tatoeba
Proper label mapping (kab/tach instead of generic LABEL_0/LABEL_1)
Fixed notorious "Ihi" false positive (72% → 99.8% confidence)
Eliminated high-confidence errors (>90% wrong in v1)

Usage

Basic Classification

from transformers import pipeline

classifier = pipeline(
 "text-classification",
 model="boffire/distilbert-kabyle-tachelhit-classifier-v2",
 tokenizer="boffire/distilbert-kabyle-tachelhit-classifier-v2"
)

# Kabyle sentence
result = classifier("Acuɣer ur d-yusi ara?")
print(result) # [{'label': 'kab', 'score': 1.000}]

# Tachelhit sentence
result = classifier("Ifl Ṭum Mary.")
print(result) # [{'label': 'tach', 'score': 1.000}]

Robust Classification with Confidence Thresholding

For production use, apply confidence thresholds based on sentence length:

def classify_robust(sentence, classifier, min_words=4):
 if not sentence or len(sentence.strip()) == 0:
 return "EMPTY", 0.0, "REJECT"

 word_count = len(sentence.split())
 result = classifier(sentence)[0]
 label = result['label']
 confidence = result['score']

 # Stricter thresholds for short sentences
 if word_count < min_words:
 high_threshold = 0.95
 low_threshold = 0.85
 else:
 high_threshold = 0.90
 low_threshold = 0.70

 if confidence >= high_threshold:
 status = "HIGH_CONF"
 elif confidence >= low_threshold:
 status = "LOW_CONF"
 else:
 status = "REJECT"
 label = "ambiguous"

 return label, confidence, status

# Example usage
classifier = pipeline(
 "text-classification",
 model="boffire/distilbert-kabyle-tachelhit-classifier-v2",
 tokenizer="boffire/distilbert-kabyle-tachelhit-classifier-v2"
)

sentences = [
 "Acuɣer ur d-yusi ara?",
 "Ifl Ṭum Mary.",
 "Ihi",
 "Aql-aɣ nettaẓ ar zdat.",
]

for sent in sentences:
 label, conf, status = classify_robust(sent, classifier)
 print(f"[{status}] {label} ({conf:.3f}): {sent}")

Recommended Pipeline for Common Voice Data Cleaning

def cv_filter_pipeline(sentences, classifier):
 auto_accept = []
 manual_review = []

 for sent in sentences:
 label, conf, status = classify_robust(sent, classifier)

 if status == "HIGH_CONF":
 auto_accept.append({
 "sentence": sent,
 "label": label,
 "confidence": conf
 })
 else:
 manual_review.append({
 "sentence": sent,
 "predicted": label,
 "confidence": conf,
 "status": status
 })

 return auto_accept, manual_review

Limitations and Biases

Known Limitations

Short sentences: Very short utterances (< 3 words) like "Ihi", "Wah", "Azul" are inherently ambiguous and may be misclassified even with high confidence
Sentence length bias: The model performs better on longer sentences (5+ words) with clear morphological markers
Domain mismatch: Training data is mostly written text (Tatoeba) and read speech (Common Voice). Spontaneous speech may differ.
Loanwords: Sentences with heavy French/Arabic loanwords may be ambiguous
Dialectal variation: Tachelhit has significant regional variation not fully captured

⚠️ Out-of-Distribution Detection

This model is a binary classifier (kab vs tach). It has no "other" or "non-Berber" class.

Do not use on non-Berber text without a pre-filter. As noted in v1, the model will "forcefully route" any input into one of two classes. This applies not only to other Amazigh varieties (Chaoui, Rifian, etc.) but to any non-Berber language — French, English, Arabic, or others will be misclassified, often with misleadingly high confidence.

Recommended pipeline:

# Stage 1: General LID pre-filter (FastText/OpenLID-v3)
general_pred = fasttext_model.predict(text, k=1)
lang = general_pred[0][0]

# Only run v2 on Berber-like text
if lang in ["__label__kab_Latn", "__label__shi_Latn", "__label__ber_Latn"]:
 result = v2_classifier(text)[0] # kab vs tach
else:
 return "NON_BERBER", 0.0, "REJECT"

Example failure:

Input: "Bonjour à tous comment allez vous ?" (French)
v2 output: kab (0.997) — wrong with near-certain confidence

Biases

Gender: Training data skews male (55.4% male vs 18.2% female in Kabyle Common Voice)
Age: Overrepresented: thirties (27.4%), fifties (19.8%); Underrepresented: teens, elderly
Geographic: Tachelhit data from Tatoeba may not represent all Tachelhit-speaking regions
Orthographic: Model assumes standard Kabyle orthography (Mammeri/INALCO). Non-standard spellings may fail.

Ethical Considerations

This model is intended for language identification and data filtering, not for making decisions about individuals
Low-resource language context: Both Kabyle and Tachelhit are under-resourced languages. Errors have real impact on dataset quality and downstream ASR/TTS systems.
Community involvement: Model development should involve native speaker communities for validation and feedback.

Citation

If you use this model, please cite:

@misc{boffire2026kabyletachelhit,
 author = {boffire},
 title = {DistilBERT Kabyle-Tachelhit Language Classifier},
 year = {2026},
 publisher = {HuggingFace},
 howpublished = {\url{https://huggingface.co/boffire/distilbert-kabyle-tachelhit-classifier-v2}}
}

Related Models

Original v1 model - Predecessor with generic labels and corrupted training data
Kabyle ASR models - NVIDIA Conformer for Kabyle speech recognition
Kabyle POS tagger - Part-of-speech tagging for Kabyle

Contact

For issues, improvements, or contributions, please open an issue on the HuggingFace repository or contact the maintainer.

This model was developed as part of Kabyle NLP and Common Voice community efforts to improve digital language resources for Berber languages.

Downloads last month: 64

Safetensors

Model size

0.1B params

Tensor type

F32

URL: https://huggingface.co/boffire/distilbert-kabyle-tachelhit-classifier-v2

⇱ boffire/distilbert-kabyle-tachelhit-classifier-v2 · Hugging Face