VOOZH about

URL: https://huggingface.co/katoernest/afro-xlmr-swahili-news-classifier

โ‡ฑ katoernest/afro-xlmr-swahili-news-classifier ยท Hugging Face


๐ŸŒ AfroXLMR Swahili News Classifier

Fine-tuned Davlan/afro-xlmr-mini on the MasakhaNEWS Swahili dataset for multi-class news and community report classification.

Built for civic reporting platforms operating across East Africa where community reports arrive in Swahili and need to be categorised for decision-making.

Model Summary

Property Value
Base model Davlan/afro-xlmr-mini
Language Swahili (sw)
Task Multi-class text classification
Dataset MasakhaNEWS Swahili
Training samples 1,658
Validation samples 237
Test samples 476
Epochs 5
Best F1 0.4736
Best Accuracy 59.49%

Training Results

Epoch Train Loss Val Loss Accuracy F1
1 1.9317 1.8947 32.49% 0.2068
2 1.8470 1.8108 41.35% 0.2943
3 1.7536 1.7107 55.27% 0.4362
4 1.6792 1.6551 59.49% 0.4705
5 1.6402 1.6337 59.49% 0.4736

The model shows consistent improvement across all 5 epochs with both training and validation loss decreasing โ€” no overfitting.

Why These Results Are Expected

MasakhaNEWS Swahili is a genuinely hard classification task:

  • 7 categories with overlapping vocabulary (politics vs elections, health vs science)
  • Small dataset โ€” only 1,658 training samples for 7 classes
  • Low-resource language โ€” limited pre-training data for Swahili even in AfroXLMR
  • Baseline random would be ~14% accuracy โ€” this model is at 59.49%

Accuracy improves significantly with more training data from the target platform.

Usage

from transformers import pipeline

classifier = pipeline(
 'text-classification',
 model = 'katoernest/afro-xlmr-swahili-news-classifier',
 tokenizer = 'katoernest/afro-xlmr-swahili-news-classifier'
)

result = classifier("mafuriko makubwa yameharibu mazao shambani")
print(result)
# [{'label': 'environment', 'score': 0.62}]

Full Pipeline Usage

from transformers import pipeline
from langdetect import detect
import re

classifier = pipeline(
 'text-classification',
 model = 'katoernest/afro-xlmr-swahili-news-classifier'
)

def classify_report(text):
 # Step 1: detect language
 language = detect(text)

 # Step 2: clean text
 clean = re.sub(r'http\S+|@\w+', '', text)
 clean = re.sub(r'\s+', ' ', clean).strip()

 # Step 3: classify
 result = classifier(clean)[0]

 return {
 'text' : text,
 'language' : language,
 'category' : result['label'],
 'confidence': round(result['score'] * 100, 1)
 }

# Test
reports = [
 'mafuriko makubwa yameharibu mazao shambani mashariki',
 'wapigakura wanakataliwa kupiga kura kituo namba nne',
 'mlipuko wa ugonjwa umethibitishwa kaskazini mwa nchi',
]

for r in reports:
 print(classify_report(r))

Pipeline Position

This model sits at the classification stage of the Distant Voices data pipeline:

Community report (SMS / WhatsApp / voice note)
 โ†“
 Whisper transcription
 โ†“
 Language detection
 โ†“
 Text cleaning
 โ†“
[This model] โ€” category classification
 โ†“
 Confidence routing
 > 0.80 โ†’ auto-approve
 0.50โ€“0.80 โ†’ human review
 < 0.50 โ†’ flag for manual classification
 โ†“
 Dashboard

Improving Accuracy

This model was trained on general Swahili news. Accuracy improves significantly when fine-tuned further on:

  • Platform-specific community reports
  • Domain vocabulary (elections, climate, crisis, humanitarian)
  • More training samples per category

The model is designed as a starting point โ€” it improves as real platform data accumulates through the human review feedback loop.

Framework Versions

  • Transformers: 4.44.0
  • PyTorch: 2.3.0+cu121
  • Datasets: 2.14.0
  • Tokenizers: 0.19.1

Author

Kato Ernest Henry AI Research and MLOps Engineer โ€” Kampala, Uganda henry38ernest@gmail.com HuggingFace GitHub

Downloads last month
67
Safetensors
Model size
0.1B params
Tensor type
F32
ยท

Model tree for katoernest/afro-xlmr-swahili-news-classifier

Finetuned
(3)
this model

Dataset used to train katoernest/afro-xlmr-swahili-news-classifier