Paper • 2507.20301 • Published
Dataset Viewer
Sudanese Arabic Machine Translation Benchmark
Dataset Description
This benchmark addresses a documented gap in Arabic NLP research: the exclusion of Sudanese Arabic from dialectal machine translation evaluation.
Recent research (Alabdullah et al., 2025) states:
"The focus on three dialects (Levantine, Gulf, Egyptian) constrained generalization, leaving the proposed translation techniques untested on varieties such as Maghrebi and Sudanese Arabic."
This dataset provides the first benchmark specifically designed for evaluating Sudanese Arabic → English translation quality.
Supported Tasks
- Translation: Sudanese Arabic → English
- Dialect Identification: Can be used to test if models distinguish Sudanese from other Arabic dialects
- Error Analysis: Annotated for linguistic challenges unique to Sudanese
Languages
- Source: Arabic (Sudanese dialect)
- Target: English
Dataset Structure
Data Fields
| Field | Type | Description |
|---|---|---|
| id | string | Unique identifier (sud_mt_XX) |
| source | string | Sudanese Arabic text |
| reference | string | English reference translation |
| difficulty | string | easy/medium/hard |
| notes | string | Linguistic annotation |
| dialect_markers | list | Sudanese-specific features present |
| confusion_risk | string | Potential for dialect confusion |
Data Splits
- test: 20 sentences (10 Sudanese Arabic + 10 MSA control)
Linguistic Features
Unique Sudanese Features
| Feature | Arabic | Meaning | MT Impact |
|---|---|---|---|
| يا زول | ya zol | "hey man" | High - unique vocative |
| حي- | hay- | future prefix | High - unique morphology |
| يتعرس | yit'arris | gets married | High - unique verb |
| حتة | hita | room | Medium - context-dependent |
Features Shared with Egyptian (Confusion Risk)
| Feature | Shared Meaning | Confusion Potential |
|---|---|---|
| دي | "this" | High |
| عايز | "want" | High |
| مش | "not" | Medium |
Citation
@dataset{mihaysi2026sudanese_mt_benchmark,
title = {Sudanese Arabic Machine Translation Benchmark},
author = {Mihaysi, Aamer},
year = {2026},
publisher = {HuggingFace},
note = {Addresses gap identified in Alabdullah et al. (2025)}
}
References
- Alabdullah, A., Han, L., & Lin, C. (2025). Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation. arXiv:2507.20301.
- Beidas, A., et al. (2025). Cross-dialectal Arabic translation: comparative analysis on large language models. Frontiers in Artificial Intelligence.
Related Datasets
- O96a/sudanese-arabic-dialect-benchmark - Dialect identification benchmark
- O96a/opus-mt-arabic-benchmark-2026-03-28 - Arabic MT benchmark
License
Creative Commons Attribution 4.0 (CC-BY-4.0)
- Downloads last month
- 23
