UHG
Search
Close this search box.

Sarvam AI, AI4Bharat, and IIT Madras Unveil IndicVoices, India’s First Comprehensive Speech Dataset 

A substantial 1639 hours have already been transcribed, with a median of 73 hours per language.

Share

Listen to this story

Sarvam AI alongside AI4Bharat and IIT Madras has unveiled IndicVoices, a comprehensive speech dataset adhering to an inclusive diversity wishlist with fair representation of demographics, domains, languages and applications. IndicVoices dataset comprises  7348 hours of natural and spontaneous speech from 16237 speakers across 145 Indian districts and 22 languages. 

The dataset, a result of a colossal nationwide effort involving 1893 personnel in various roles, includes read (9%), extempore (74%), and conversational (17%) audio segments. A substantial 1639 hours have already been transcribed, with a median of 73 hours per language.

Using INDICVOICES,  they have built IndicASR, the first ASR model to support all the 22 languages listed in the 8th schedule of the Constitution of India. All the data, tools, guidelines, models and other materials developed as a part of this work will be made publicly available.

The project, generously funded by BHASHINI, Ministry of Electronics and Information Technology, Government of India, and supported by grants from Nilekani Philanthropies and EkStep Foundation, aims to serve as a comprehensive resource for data collection in multilingual regions globally. 

Materials developed as part of this initiative include prompts for digital interactions, questions from various domains, conversational role-play scenarios, elaborate transcription guidelines, an Android application for on-field data collection and verification, and a web-based platform for transcription workflow management. 

These tools and the dataset will be released with an MIT license and a CC-BY-4.0 license, respectively, allowing for widespread commercial usage. You can check out IndicVoices here. 

📣 Want to advertise in AIM? Book here

Picture of Siddharth Jindal

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.