🌍 AfroXLMR Swahili News Classifier

Fine-tuned Davlan/afro-xlmr-mini on the MasakhaNEWS Swahili dataset for multi-class news and community report classification.

Built for civic reporting platforms operating across East Africa where community reports arrive in Swahili and need to be categorised for decision-making.

Model Summary

Property	Value
Base model	Davlan/afro-xlmr-mini
Language	Swahili (sw)
Task	Multi-class text classification
Dataset	MasakhaNEWS Swahili
Training samples	1,658
Validation samples	237
Test samples	476
Epochs	5
Best F1	0.4736
Best Accuracy	59.49%

Training Results

Epoch	Train Loss	Val Loss	Accuracy	F1
1	1.9317	1.8947	32.49%	0.2068
2	1.8470	1.8108	41.35%	0.2943
3	1.7536	1.7107	55.27%	0.4362
4	1.6792	1.6551	59.49%	0.4705
5	1.6402	1.6337	59.49%	0.4736

The model shows consistent improvement across all 5 epochs with both training and validation loss decreasing — no overfitting.

Why These Results Are Expected

MasakhaNEWS Swahili is a genuinely hard classification task:

7 categories with overlapping vocabulary (politics vs elections, health vs science)
Small dataset — only 1,658 training samples for 7 classes
Low-resource language — limited pre-training data for Swahili even in AfroXLMR
Baseline random would be ~14% accuracy — this model is at 59.49%

Accuracy improves significantly with more training data from the target platform.

Usage

from transformers import pipeline

classifier = pipeline(
 'text-classification',
 model = 'katoernest/afro-xlmr-swahili-news-classifier',
 tokenizer = 'katoernest/afro-xlmr-swahili-news-classifier'
)

result = classifier("mafuriko makubwa yameharibu mazao shambani")
print(result)
# [{'label': 'environment', 'score': 0.62}]

Full Pipeline Usage

from transformers import pipeline
from langdetect import detect
import re

classifier = pipeline(
 'text-classification',
 model = 'katoernest/afro-xlmr-swahili-news-classifier'
)

def classify_report(text):
 # Step 1: detect language
 language = detect(text)

 # Step 2: clean text
 clean = re.sub(r'http\S+|@\w+', '', text)
 clean = re.sub(r'\s+', ' ', clean).strip()

 # Step 3: classify
 result = classifier(clean)[0]

 return {
 'text' : text,
 'language' : language,
 'category' : result['label'],
 'confidence': round(result['score'] * 100, 1)
 }

# Test
reports = [
 'mafuriko makubwa yameharibu mazao shambani mashariki',
 'wapigakura wanakataliwa kupiga kura kituo namba nne',
 'mlipuko wa ugonjwa umethibitishwa kaskazini mwa nchi',
]

for r in reports:
 print(classify_report(r))

Pipeline Position

This model sits at the classification stage of the Distant Voices data pipeline:

Community report (SMS / WhatsApp / voice note)
 ↓
 Whisper transcription
 ↓
 Language detection
 ↓
 Text cleaning
 ↓
[This model] — category classification
 ↓
 Confidence routing
 > 0.80 → auto-approve
 0.50–0.80 → human review
 < 0.50 → flag for manual classification
 ↓
 Dashboard

Improving Accuracy

This model was trained on general Swahili news. Accuracy improves significantly when fine-tuned further on:

Platform-specific community reports
Domain vocabulary (elections, climate, crisis, humanitarian)
More training samples per category

The model is designed as a starting point — it improves as real platform data accumulates through the human review feedback loop.

Framework Versions

Transformers: 4.44.0
PyTorch: 2.3.0+cu121
Datasets: 2.14.0
Tokenizers: 0.19.1

Author

Kato Ernest Henry AI Research and MLOps Engineer — Kampala, Uganda henry38ernest@gmail.com HuggingFace GitHub

Downloads last month: 67

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for katoernest/afro-xlmr-swahili-news-classifier

Base model

Davlan/afro-xlmr-mini

Finetuned

(3)

this model

URL: https://huggingface.co/katoernest/afro-xlmr-swahili-news-classifier

⇱ katoernest/afro-xlmr-swahili-news-classifier · Hugging Face