Voozh

The Dataset Viewer has been disabled on this dataset.

IITB Document level Monolingual Corpora for Indian languages.

22 scheduled languages of India + English

(1) Assamese, (2) Bengali, (3) Gujarati, (4) Hindi, (5) Kannada, (6) Kashmiri, (7) Konkani, (8) Malayalam, (9) Manipuri, (10) Marathi, (11) Nepali, (12) Oriya, (13) Punjabi, (14) Sanskrit, (15) Sindhi, (16) Tamil, (17) Telugu, (18) Urdu (19) Bodo, (20) Santhali, (21) Maithili and (22) Dogri.

Language	Total (#Mil Tokens)
bn	5258.47
en	11986.53
gu	887.18
hi	11268.33
kn	567.16
ml	845.32
mr	1066.76
ne	1542.39
pa	449.61
ta	2171.92
te	767.18
ur	2391.79
as	57.64
brx	2.25
doi	0.37
gom	2.91
kas	1.27
mai	1.51
mni	0.99
or	81.96
sa	80.09
sat	3.05
sd	83.81
Total=	39518.51

To cite this dataset:

@inproceedings{doshi-etal-2024-pretraining,
 title = "Pretraining Language Models Using Translationese",
 author = "Doshi, Meet and
 Dabre, Raj and
 Bhattacharyya, Pushpak",
 editor = "Al-Onaizan, Yaser and
 Bansal, Mohit and
 Chen, Yun-Nung",
 booktitle = "Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing",
 month = nov,
 year = "2024",
 address = "Miami, Florida, USA",
 publisher = "Association for Computational Linguistics",
 url = "https://aclanthology.org/2024.emnlp-main.334/",
 doi = "10.18653/v1/2024.emnlp-main.334",
 pages = "5843--5862",
}

Downloads last month: 11,621

URL: https://huggingface.co/datasets/cfilt/IITB-IndicMonoDoc

⇱ cfilt/IITB-IndicMonoDoc · Datasets at Hugging Face