This is about kabyle language resources. • 24 items • Updated
Kabyle Sentence Transformer (MPNet)
A sentence embedding model specifically fine-tuned for Kabyle (Taqbaylit) - English cross-lingual semantic similarity.
Model Details
| Attribute | Value |
|---|---|
| Base model | sentence-transformers/paraphrase-multilingual-mpnet-base-v2 |
| Fine-tuning data | ~2.5M unique EN–KAB parallel sentences |
| Embedding dimension | 768 |
| Training framework | SentenceTransformers |
| Training time | ~1h 16min (1 epoch, 15,593 steps) |
| Final loss | 0.043 (started at 0.278) |
Training Data
| Source | Pairs | Description |
|---|---|---|
| NLLB (cleaned) | ~2.35M | Diverse domain parallel corpus |
| Tatoeba + CS | ~202K | Community translations + software localization |
| Weblate | ~9K | FLOSS UI strings |
| LibreTranslate | ~449 | User-reviewed translations |
Performance
Compared to the base paraphrase-multilingual-mpnet-base-v2 (untrained):
| Metric | Base | This Model | Gain |
|---|---|---|---|
| Avg. cosine similarity (EN<->KAB) | 0.278 | 0.857 | +58 points |
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("boffire/kabyle-sentence-transformer-mpnet")
# Embed English and Kabyle
sentences = ["Hello!", "Azul!"]
embeddings = model.encode(sentences)
# Cross-lingual similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embeddings[0]], [embeddings[1]])
print(sim)
Limitations
- Trained primarily on parallel data; monolingual Kabyle similarity not explicitly optimized
- Best for EN<->KAB cross-lingual tasks; Kabyle<->Kabyle may work but is untested
- Religious text overrepresented in NLLB portion; may underperform on highly technical/modern domains
- Evaluator used constant labels (all 1.0) due to all pairs being positive; correlation metrics were undefined
Future Work
- Train v2 with
Davlan/afro-xlmr-largebackbone for African-specific pretraining - Add monolingual Kabyle data for better Kabyle<->Kabyle similarity
- Fix evaluator to use
AvgCosineEvaluatorinstead of correlation-based metrics - Evaluate against LASER on a proper benchmark
Citation
If you use this model, please cite:
@misc{kabyle-st-mpnet,
title={Kabyle Sentence Transformer},
author={boffire},
year={2026},
howpublished={\url{https://huggingface.co/boffire/kabyle-sentence-transformer-mpnet}}
}
Acknowledgments
- Imsidag-community for the cleaned parallel corpora
- Tatoeba contributors for community translations
- Meta AI for LASER and NLLB datasets
- boffire community for Kabyle NLP tooling
- Downloads last month
- 20
Safetensors
Model size
0.3B params
Tensor type
F32
·
