VOOZH about

URL: https://huggingface.co/datasets/castorini/wura

⇱ castorini/wura · Datasets at Hugging Face


Dataset Viewer

The viewer is disabled because this dataset repo requires arbitrary Python code execution. Please consider removing the loading script and relying on automated data support (you can use convert_to_parquet from the datasets library). If this is not possible, please open a discussion for direct help.

Dataset Summary

WURA is a document-level dataset covering 16 African Languages and 4 high-resource languages widely spoken in Africa (English, French, Arabic and Portuguese). This dataset was created by auditing mC4 and crawling additional verified news sources. It was first used to train AfriTeVa V2.

Dataset Structure

>>> from datasets import load_dataset

Although the document-level dataset is loaded by default, you may also optionally load a passage-level dataset as follows

>>> data = load_dataset("castorini/wura, "yor", level="passage", verification_mode="no_checks")

Note that we must pass verification_mode="no_checks to prevent HF from verifying checksums against the document-level checksum infos.

Citation

@inproceedings{oladipo-etal-2023-better,
 title = "Better Quality Pre-training Data and T5 Models for {A}frican Languages",
 author = "Oladipo, Akintunde and
 Adeyemi, Mofetoluwa and
 Ahia, Orevaoghene and
 Owodunni, Abraham and
 Ogundepo, Odunayo and
 Adelani, David and
 Lin, Jimmy",
 editor = "Bouamor, Houda and
 Pino, Juan and
 Bali, Kalika",
 booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
 month = dec,
 year = "2023",
 address = "Singapore",
 publisher = "Association for Computational Linguistics",
 url = "https://aclanthology.org/2023.emnlp-main.11",
 pages = "158--168",
 abstract = "In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawls have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new multilingual pretraining corpus for 16 African languages, designed by carefully auditing existing pretraining corpora to understand and rectify prevalent quality issues. To compile this dataset, we undertake a rigorous examination of current data sources for thirteen languages within one of the most extensive multilingual web crawls, mC4, and extract cleaner data through meticulous auditing and improved web crawling strategies. Subsequently, we pretrain a new T5-based model on this dataset and evaluate its performance on multiple downstream tasks. Our model demonstrates better downstream effectiveness over existing pretrained models across four NLP tasks, underscoring the critical role data quality plays in pretraining language models in low-resource scenarios. Specifically, on cross-lingual QA evaluation, our new model is more than twice as effective as multilingual T5. All code, data and models are publicly available at https://github.com/castorini/AfriTeVa-keji.",
}
Downloads last month
124

Models trained or fine-tuned on castorini/wura

Collection including castorini/wura