๐ AfroXLMR Swahili News Classifier
Fine-tuned Davlan/afro-xlmr-mini on the MasakhaNEWS Swahili dataset for multi-class news and community report classification.
Built for civic reporting platforms operating across East Africa where community reports arrive in Swahili and need to be categorised for decision-making.
Model Summary
| Property | Value |
|---|---|
| Base model | Davlan/afro-xlmr-mini |
| Language | Swahili (sw) |
| Task | Multi-class text classification |
| Dataset | MasakhaNEWS Swahili |
| Training samples | 1,658 |
| Validation samples | 237 |
| Test samples | 476 |
| Epochs | 5 |
| Best F1 | 0.4736 |
| Best Accuracy | 59.49% |
Training Results
| Epoch | Train Loss | Val Loss | Accuracy | F1 |
|---|---|---|---|---|
| 1 | 1.9317 | 1.8947 | 32.49% | 0.2068 |
| 2 | 1.8470 | 1.8108 | 41.35% | 0.2943 |
| 3 | 1.7536 | 1.7107 | 55.27% | 0.4362 |
| 4 | 1.6792 | 1.6551 | 59.49% | 0.4705 |
| 5 | 1.6402 | 1.6337 | 59.49% | 0.4736 |
The model shows consistent improvement across all 5 epochs with both training and validation loss decreasing โ no overfitting.
Why These Results Are Expected
MasakhaNEWS Swahili is a genuinely hard classification task:
- 7 categories with overlapping vocabulary (politics vs elections, health vs science)
- Small dataset โ only 1,658 training samples for 7 classes
- Low-resource language โ limited pre-training data for Swahili even in AfroXLMR
- Baseline random would be ~14% accuracy โ this model is at 59.49%
Accuracy improves significantly with more training data from the target platform.
Usage
from transformers import pipeline
classifier = pipeline(
'text-classification',
model = 'katoernest/afro-xlmr-swahili-news-classifier',
tokenizer = 'katoernest/afro-xlmr-swahili-news-classifier'
)
result = classifier("mafuriko makubwa yameharibu mazao shambani")
print(result)
# [{'label': 'environment', 'score': 0.62}]
Full Pipeline Usage
from transformers import pipeline
from langdetect import detect
import re
classifier = pipeline(
'text-classification',
model = 'katoernest/afro-xlmr-swahili-news-classifier'
)
def classify_report(text):
# Step 1: detect language
language = detect(text)
# Step 2: clean text
clean = re.sub(r'http\S+|@\w+', '', text)
clean = re.sub(r'\s+', ' ', clean).strip()
# Step 3: classify
result = classifier(clean)[0]
return {
'text' : text,
'language' : language,
'category' : result['label'],
'confidence': round(result['score'] * 100, 1)
}
# Test
reports = [
'mafuriko makubwa yameharibu mazao shambani mashariki',
'wapigakura wanakataliwa kupiga kura kituo namba nne',
'mlipuko wa ugonjwa umethibitishwa kaskazini mwa nchi',
]
for r in reports:
print(classify_report(r))
Pipeline Position
This model sits at the classification stage of the Distant Voices data pipeline:
Community report (SMS / WhatsApp / voice note)
โ
Whisper transcription
โ
Language detection
โ
Text cleaning
โ
[This model] โ category classification
โ
Confidence routing
> 0.80 โ auto-approve
0.50โ0.80 โ human review
< 0.50 โ flag for manual classification
โ
Dashboard
Improving Accuracy
This model was trained on general Swahili news. Accuracy improves significantly when fine-tuned further on:
- Platform-specific community reports
- Domain vocabulary (elections, climate, crisis, humanitarian)
- More training samples per category
The model is designed as a starting point โ it improves as real platform data accumulates through the human review feedback loop.
Framework Versions
- Transformers: 4.44.0
- PyTorch: 2.3.0+cu121
- Datasets: 2.14.0
- Tokenizers: 0.19.1
Author
Kato Ernest Henry AI Research and MLOps Engineer โ Kampala, Uganda henry38ernest@gmail.com HuggingFace GitHub
- Downloads last month
- 67
Model tree for katoernest/afro-xlmr-swahili-news-classifier
Base model
Davlan/afro-xlmr-mini