VOOZH about

URL: https://huggingface.co/datasets/cfilt/IITB-IndicMonoDoc

⇱ cfilt/IITB-IndicMonoDoc · Datasets at Hugging Face


The Dataset Viewer has been disabled on this dataset.

IITB Document level Monolingual Corpora for Indian languages.

22 scheduled languages of India + English

(1) Assamese, (2) Bengali, (3) Gujarati, (4) Hindi, (5) Kannada, (6) Kashmiri, (7) Konkani, (8) Malayalam, (9) Manipuri, (10) Marathi, (11) Nepali, (12) Oriya, (13) Punjabi, (14) Sanskrit, (15) Sindhi, (16) Tamil, (17) Telugu, (18) Urdu (19) Bodo, (20) Santhali, (21) Maithili and (22) Dogri.

Language Total (#Mil Tokens)
bn 5258.47
en 11986.53
gu 887.18
hi 11268.33
kn 567.16
ml 845.32
mr 1066.76
ne 1542.39
pa 449.61
ta 2171.92
te 767.18
ur 2391.79
as 57.64
brx 2.25
doi 0.37
gom 2.91
kas 1.27
mai 1.51
mni 0.99
or 81.96
sa 80.09
sat 3.05
sd 83.81
Total= 39518.51

To cite this dataset:

@inproceedings{doshi-etal-2024-pretraining,
 title = "Pretraining Language Models Using Translationese",
 author = "Doshi, Meet and
 Dabre, Raj and
 Bhattacharyya, Pushpak",
 editor = "Al-Onaizan, Yaser and
 Bansal, Mohit and
 Chen, Yun-Nung",
 booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
 month = nov,
 year = "2024",
 address = "Miami, Florida, USA",
 publisher = "Association for Computational Linguistics",
 url = "https://aclanthology.org/2024.emnlp-main.334/",
 doi = "10.18653/v1/2024.emnlp-main.334",
 pages = "5843--5862",
}
Downloads last month
11,621