Listen to this story
|
Sarvam AI alongside AI4Bharat and IIT Madras has unveiled IndicVoices, a comprehensive speech dataset adhering to an inclusive diversity wishlist with fair representation of demographics, domains, languages and applications. IndicVoices dataset comprises 7348 hours of natural and spontaneous speech from 16237 speakers across 145 Indian districts and 22 languages.
The dataset, a result of a colossal nationwide effort involving 1893 personnel in various roles, includes read (9%), extempore (74%), and conversational (17%) audio segments. A substantial 1639 hours have already been transcribed, with a median of 73 hours per language.
Using INDICVOICES, they have built IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available.
The project, generously funded by BHASHINI, Ministry of Electronics and Information Technology, Government of India, and supported by grants from Nilekani Philanthropies and EkStep Foundation, aims to serve as a comprehensive resource for data collection in multilingual regions globally.
Materials developed as part of this initiative include prompts for digital interactions, questions from various domains, conversational role-play scenarios, elaborate transcription guidelines, an Android application for on-field data collection and verification, and a web-based platform for transcription workflow management.
These tools and the dataset will be released with an MIT license and a CC-BY-4.0 license, respectively, allowing for widespread commercial usage. You can check out IndicVoices here.