VOOZH about

URL: https://huggingface.co/datasets/O96a/sudanese-mt-benchmark

⇱ O96a/sudanese-mt-benchmark · Datasets at Hugging Face


Dataset Viewer

Sudanese Arabic Machine Translation Benchmark

Dataset Description

This benchmark addresses a documented gap in Arabic NLP research: the exclusion of Sudanese Arabic from dialectal machine translation evaluation.

Recent research (Alabdullah et al., 2025) states:

"The focus on three dialects (Levantine, Gulf, Egyptian) constrained generalization, leaving the proposed translation techniques untested on varieties such as Maghrebi and Sudanese Arabic."

This dataset provides the first benchmark specifically designed for evaluating Sudanese Arabic → English translation quality.

Supported Tasks

  • Translation: Sudanese Arabic → English
  • Dialect Identification: Can be used to test if models distinguish Sudanese from other Arabic dialects
  • Error Analysis: Annotated for linguistic challenges unique to Sudanese

Languages

  • Source: Arabic (Sudanese dialect)
  • Target: English

Dataset Structure

Data Fields

Field Type Description
id string Unique identifier (sud_mt_XX)
source string Sudanese Arabic text
reference string English reference translation
difficulty string easy/medium/hard
notes string Linguistic annotation
dialect_markers list Sudanese-specific features present
confusion_risk string Potential for dialect confusion

Data Splits

  • test: 20 sentences (10 Sudanese Arabic + 10 MSA control)

Linguistic Features

Unique Sudanese Features

Feature Arabic Meaning MT Impact
يا زول ya zol "hey man" High - unique vocative
حي- hay- future prefix High - unique morphology
يتعرس yit'arris gets married High - unique verb
حتة hita room Medium - context-dependent

Features Shared with Egyptian (Confusion Risk)

Feature Shared Meaning Confusion Potential
دي "this" High
عايز "want" High
مش "not" Medium

Citation

@dataset{mihaysi2026sudanese_mt_benchmark,
 title = {Sudanese Arabic Machine Translation Benchmark},
 author = {Mihaysi, Aamer},
 year = {2026},
 publisher = {HuggingFace},
 note = {Addresses gap identified in Alabdullah et al. (2025)}
}

References

  • Alabdullah, A., Han, L., & Lin, C. (2025). Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation. arXiv:2507.20301.
  • Beidas, A., et al. (2025). Cross-dialectal Arabic translation: comparative analysis on large language models. Frontiers in Artificial Intelligence.

Related Datasets

License

Creative Commons Attribution 4.0 (CC-BY-4.0)

Downloads last month
23

Paper for O96a/sudanese-mt-benchmark