VOOZH about

URL: https://huggingface.co/datasets/ai4bharat/Shrutilipi

⇱ ai4bharat/Shrutilipi · Datasets at Hugging Face


You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this dataset content.

Shrutilipi

Overview

Shrutilipi is a labelled ASR corpus obtained by mining parallel audio and text pairs at the document scale from All India Radio news bulletins for 12 Indian languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Sanskrit, Tamil, Telugu, Urdu. The corpus has over 6400 hours of data across all languages.

This work is funded by Bhashini, MeitY and Nilekani Philanthropies

Usage

The datasets library enables you to load and preprocess the dataset directly in Python. Ensure you have an active HuggingFace access token (obtainable from Hugging Face settings) before proceeding.

To load the dataset, run:

from datasets import load_dataset
# Load the dataset from the HuggingFace Hub
dataset = load_dataset("ai4bharat/Shrutilipi","bengali",split="train")
# Check the dataset structure
print(dataset)

You can also stream the dataset by enabling the streaming=True flag:

from datasets import load_dataset
dataset = load_dataset("ai4bharat/Shrutilipi","bengali",split="train", streaming=True)
print(next(iter(dataset)))

Citation

If you use Shrutilipi in your work, please cite us:

@inproceedings{DBLP:conf/icassp/BhogaleRJDKKK23,
 author = {Kaushal Santosh Bhogale and
 Abhigyan Raman and
 Tahir Javed and
 Sumanth Doddapaneni and
 Anoop Kunchukuttan and
 Pratyush Kumar and
 Mitesh M. Khapra},
 title = {Effectiveness of Mining Audio and Text Pairs from Public Data for
 Improving {ASR} Systems for Low-Resource Languages},
 booktitle = {{ICASSP}},
 pages = {1--5},
 publisher = {{IEEE}},
 year = {2023}
}

License

This dataset is released under the CC BY 4.0.

Contact

For any questions or feedback, please contact:

Downloads last month
1,325

Models trained or fine-tuned on ai4bharat/Shrutilipi

Papers for ai4bharat/Shrutilipi