You need to agree to share your contact information to access this dataset

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

IndicVoices: Towards building an Inclusive Multilingual Speech Dataset for Indian Languages

Updates

[23 December 2025] We now have 11,200 hours of transcribed data! 🎉

Overview

INDICVOICES is a dataset of natural and spontaneous speech containing a total of 23.7K hours of read (8%), extempore (76%) and conversational (15%) audio from 51K speakers covering 400+ Indian districts and 22 languages. Of these 23.7K hours, 11.2K hours have already been transcribed. Through this paper, we share our journey of capturing the cultural, linguistic and demographic diversity of India to create a one-of-its-kind inclusive and representative dataset. More specifically, we share an open-source blueprint for data collection at scale comprising of standardised protocols, centralised tools, a repository of engaging questions, prompts and conversation scenarios spanning multiple domains and topics of interest, quality control mechanisms, comprehensive transcription guidelines and transcription tools. We hope that this open source blueprint will serve as a comprehensive starter kit for data collection efforts in other multilingual regions of the world. Using INDICVOICES, we build IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India.

This work is funded by Bhashini, MeitY and Nilekani Philanthropies

Usage

The datasets library enables you to load and preprocess the dataset directly in Python. Ensure you have an active HuggingFace access token (obtainable from Hugging Face settings) before proceeding.

To load the dataset, run:

from datasets import load_dataset
# Load the dataset from the HuggingFace Hub
dataset = load_dataset("ai4bharat/IndicVoices","assamese",split="valid")
# Check the dataset structure
print(dataset)

You can also stream the dataset by enabling the streaming=True flag:

from datasets import load_dataset
dataset = load_dataset("ai4bharat/IndicVoices","assamese",split="valid", streaming=True)
print(next(iter(dataset)))

Accessing Full-Length Audio Files

While the dataset hosted on Hugging Face contains the curated dataset and annotations, we also provide full-length audio recordings separately for users who require the original audio files.

The full audio data is distributed as multiple incremental, non-overlapping archives. Each archive contains a different subset of the recordings. To obtain the complete set of full-length audio for a language, you must download all versions (v1–v5).

The download URLs follow the pattern below:

https://iv-release.objectstore.e2enetworks.net/dmu_release/v1_<LANGUAGE>_train.tgz
https://iv-release.objectstore.e2enetworks.net/dmu_release/v2_<LANGUAGE>_train.tgz
https://iv-release.objectstore.e2enetworks.net/dmu_release/v3_<LANGUAGE>_train.tgz
https://iv-release.objectstore.e2enetworks.net/dmu_release/v4_<LANGUAGE>_train.tgz
https://iv-release.objectstore.e2enetworks.net/dmu_release/v5_<LANGUAGE>_train.tgz

Replace <LANGUAGE> with the desired language name (e.g., Hindi, Bengali, Tamil, Kannada, Malayalam and so on ...).

Example

https://iv-release.objectstore.e2enetworks.net/dmu_release/v1_Hindi_train.tgz
https://iv-release.objectstore.e2enetworks.net/dmu_release/v2_Hindi_train.tgz
https://iv-release.objectstore.e2enetworks.net/dmu_release/v3_Hindi_train.tgz
https://iv-release.objectstore.e2enetworks.net/dmu_release/v4_Hindi_train.tgz
https://iv-release.objectstore.e2enetworks.net/dmu_release/v5_Hindi_train.tgz

Download and Extraction

You can download and extract the archives using:

# Download
wget <URL>

# Extract
tar -xvf <FILENAME>.tgz

Alternatively, the links can be opened directly in a web browser.

File Structure

After extraction, each audio file is accompanied by a corresponding .json file containing metadata and annotation details for that recording.

Citation

If you use IndicVoices in your work, please cite us:

@inproceedings{DBLP:conf/acl/JavedNGJBMSAFPR24,
 author = {Tahir Javed and
 Janki Nawale and
 Eldho Ittan George and
 Sakshi Joshi and
 Kaushal Santosh Bhogale and
 Deovrat Mehendale and
 Ishvinder Virender Sethi and
 Aparna Ananthanarayanan and
 Hafsah Faquih and
 Pratiti Palit and
 Sneha Ravishankar and
 Saranya Sukumaran and
 Tripura Panchagnula and
 Sunjay Murali and
 Kunal Sharad Gandhi and
 Ambujavalli R and
 Manickam K. M and
 C. Venkata Vaijayanthi and
 Krishnan Srinivasa Raghavan Karunganni and
 Pratyush Kumar and
 Mitesh M. Khapra},
 title = {IndicVoices: Towards building an Inclusive Multilingual Speech Dataset
 for Indian Languages},
 booktitle = {{ACL} (Findings)},
 pages = {10740--10782},
 publisher = {Association for Computational Linguistics},
 year = {2024}
}