VOOZH about

URL: https://huggingface.co/boffire/distilbert-kabyle-tachelhit-classifier-v2

⇱ boffire/distilbert-kabyle-tachelhit-classifier-v2 · Hugging Face


DistilBERT Kabyle-Tachelhit Language Classifier

A fine-tuned DistilBERT model for binary classification between Kabyle (kab) and Tachelhit (shi) languages, two major Berber/Tamazight languages of North Africa.

Model Details

Attribute Value
Architecture DistilBERT (distilbert-base-multilingual-cased)
Parameters ~66M (135M total with classifier head)
Task Text Classification (Binary: Kabyle vs Tachelhit)
Languages Kabyle (kab), Tachelhit (shi)
Fine-tuned from distilbert-base-multilingual-cased
Training data 31,822 sentences (15,911 per class)
Test accuracy 91.9% on real-world sentences

Training Data

Kabyle (kab)

  • Source: Mozilla Common Voice Kabyle corpus (cleaned)
  • Size: 15,911 training sentences
  • Preprocessing: Greek epsilon (ε) normalized to Latin open E (ɛ), Greek gamma (γ) to Latin gamma (ɣ), Turkish ğ to Kabyle ǧ
  • Splits: Train/Dev/Test from Common Voice validated clips

Tachelhit (shi)

  • Primary source: Tatoeba corpus (22,673 sentences)
  • Secondary source: Mozilla Data Collective (230 clips)
  • Size: 15,911 training sentences (balanced with Kabyle)
  • Total unique: 22,730 sentences after deduplication

Performance

Test Set Results (6,820 sentences)

Metric Value
Accuracy 98.86%
F1 Score 98.85%
Precision 99.14%
Recall 98.56%

Real-World Evaluation (37 diverse sentences)

Metric v1 (Original) v2 (Updated) Improvement
Accuracy 75.7% 91.9% +16.2 pp

Key Improvements Over v1

  • Fixed encoding errors (Greek ε → Latin ɛ, γ → ɣ, ğ → ǧ)
  • Added 22,673 Tachelhit sentences from Tatoeba
  • Proper label mapping (kab/tach instead of generic LABEL_0/LABEL_1)
  • Fixed notorious "Ihi" false positive (72% → 99.8% confidence)
  • Eliminated high-confidence errors (>90% wrong in v1)

Usage

Basic Classification

from transformers import pipeline

classifier = pipeline(
 "text-classification",
 model="boffire/distilbert-kabyle-tachelhit-classifier-v2",
 tokenizer="boffire/distilbert-kabyle-tachelhit-classifier-v2"
)

# Kabyle sentence
result = classifier("Acuɣer ur d-yusi ara?")
print(result) # [{'label': 'kab', 'score': 1.000}]

# Tachelhit sentence
result = classifier("Ifl Ṭum Mary.")
print(result) # [{'label': 'tach', 'score': 1.000}]

Robust Classification with Confidence Thresholding

For production use, apply confidence thresholds based on sentence length:

def classify_robust(sentence, classifier, min_words=4):
 if not sentence or len(sentence.strip()) == 0:
 return "EMPTY", 0.0, "REJECT"

 word_count = len(sentence.split())
 result = classifier(sentence)[0]
 label = result['label']
 confidence = result['score']

 # Stricter thresholds for short sentences
 if word_count < min_words:
 high_threshold = 0.95
 low_threshold = 0.85
 else:
 high_threshold = 0.90
 low_threshold = 0.70

 if confidence >= high_threshold:
 status = "HIGH_CONF"
 elif confidence >= low_threshold:
 status = "LOW_CONF"
 else:
 status = "REJECT"
 label = "ambiguous"

 return label, confidence, status

# Example usage
classifier = pipeline(
 "text-classification",
 model="boffire/distilbert-kabyle-tachelhit-classifier-v2",
 tokenizer="boffire/distilbert-kabyle-tachelhit-classifier-v2"
)

sentences = [
 "Acuɣer ur d-yusi ara?",
 "Ifl Ṭum Mary.",
 "Ihi",
 "Aql-aɣ nettaẓ ar zdat.",
]

for sent in sentences:
 label, conf, status = classify_robust(sent, classifier)
 print(f"[{status}] {label} ({conf:.3f}): {sent}")

Recommended Pipeline for Common Voice Data Cleaning

def cv_filter_pipeline(sentences, classifier):
 auto_accept = []
 manual_review = []

 for sent in sentences:
 label, conf, status = classify_robust(sent, classifier)

 if status == "HIGH_CONF":
 auto_accept.append({
 "sentence": sent,
 "label": label,
 "confidence": conf
 })
 else:
 manual_review.append({
 "sentence": sent,
 "predicted": label,
 "confidence": conf,
 "status": status
 })

 return auto_accept, manual_review

Limitations and Biases

Known Limitations

  1. Short sentences: Very short utterances (< 3 words) like "Ihi", "Wah", "Azul" are inherently ambiguous and may be misclassified even with high confidence
  2. Sentence length bias: The model performs better on longer sentences (5+ words) with clear morphological markers
  3. Domain mismatch: Training data is mostly written text (Tatoeba) and read speech (Common Voice). Spontaneous speech may differ.
  4. Loanwords: Sentences with heavy French/Arabic loanwords may be ambiguous
  5. Dialectal variation: Tachelhit has significant regional variation not fully captured

⚠️ Out-of-Distribution Detection

This model is a binary classifier (kab vs tach). It has no "other" or "non-Berber" class.

Do not use on non-Berber text without a pre-filter. As noted in v1, the model will "forcefully route" any input into one of two classes. This applies not only to other Amazigh varieties (Chaoui, Rifian, etc.) but to any non-Berber language — French, English, Arabic, or others will be misclassified, often with misleadingly high confidence.

Recommended pipeline:

# Stage 1: General LID pre-filter (FastText/OpenLID-v3)
general_pred = fasttext_model.predict(text, k=1)
lang = general_pred[0][0]

# Only run v2 on Berber-like text
if lang in ["__label__kab_Latn", "__label__shi_Latn", "__label__ber_Latn"]:
 result = v2_classifier(text)[0] # kab vs tach
else:
 return "NON_BERBER", 0.0, "REJECT"

Example failure:

  • Input: "Bonjour à tous comment allez vous ?" (French)
  • v2 output: kab (0.997)wrong with near-certain confidence

Biases

  • Gender: Training data skews male (55.4% male vs 18.2% female in Kabyle Common Voice)
  • Age: Overrepresented: thirties (27.4%), fifties (19.8%); Underrepresented: teens, elderly
  • Geographic: Tachelhit data from Tatoeba may not represent all Tachelhit-speaking regions
  • Orthographic: Model assumes standard Kabyle orthography (Mammeri/INALCO). Non-standard spellings may fail.

Ethical Considerations

  • This model is intended for language identification and data filtering, not for making decisions about individuals
  • Low-resource language context: Both Kabyle and Tachelhit are under-resourced languages. Errors have real impact on dataset quality and downstream ASR/TTS systems.
  • Community involvement: Model development should involve native speaker communities for validation and feedback.

Citation

If you use this model, please cite:

@misc{boffire2026kabyletachelhit,
 author = {boffire},
 title = {DistilBERT Kabyle-Tachelhit Language Classifier},
 year = {2026},
 publisher = {HuggingFace},
 howpublished = {\url{https://huggingface.co/boffire/distilbert-kabyle-tachelhit-classifier-v2}}
}

Related Models

Contact

For issues, improvements, or contributions, please open an issue on the HuggingFace repository or contact the maintainer.


This model was developed as part of Kabyle NLP and Common Voice community efforts to improve digital language resources for Berber languages.

Downloads last month
64
Safetensors
Model size
0.1B params
Tensor type
F32
·