The Indian Institute of Science (IISc) AI and Robotics Technology Park (ARTPARK) is set to open-source 16,000 hours of spontaneous speech data from 80 districts as part of Project Vaani, under the Ministry of Electronics and Information Technology’s flagship AI initiative, BHASHINI.
This effort is in collaboration with the Google.
The ambitious project aims to curate datasets of 150,000 hours of natural speech and text from approximately one million people across 773 districts in India. The first phase of the project, launched at the end of 2022, is nearing completion.
In its second phase, Project Vaani will target 160 districts, collecting 200 hours of speech data from about 1,000 people per district. So far, voice data in 58 different language variants or dialects has been gathered from 80 districts and will soon be made publicly available.
Prasanta Ghosh, Assistant Professor in the Department of Electrical Engineering at IISc, leading Project Vaani, mentioned, “Our journey in the Indic language data sets started in 2020 with RESPIN (recognising speech in Indian languages).”
When speaking with AIM, Ghosh said, “We record people with impairments and are building technology that can understand them. Maybe humans can’t, but AI would be able to,” said Ghosh. He said that while collecting the data, he realised that there is no corpus even to build technologies for healthy people. “And then the first thing came to my mind was the idea to build a good foundational model for them.”
Apart from healthcare, Ghosh is also focused on using AI for education in India on his project called English Gyani. Funded by the Ministry of Science and Technology, it is also called the IMPRINT project. “We’re building tools to help young jobseekers learn spoken English and comprehension skills right from their homes,” said Ghosh.
“Bhashini, for the first time, helped create the digital data for low resource languages in its effort to build AI models for low resource languages,” said Amitabh Nag, Chief Executive of Bhashini.
The primary goal is to use this dataset as training data for speech-to-text AI models, particularly benefiting conversational AI platforms and chatbots requiring diverse voice datasets. Organisations can leverage this data to develop speech models for various Indian languages and dialects, aligning with their business objectives.
A Google spokesperson highlighted the project’s significance, stating, “Project Vaani was born out of the deep need to make high-quality, diverse & anonymised datasets available for people to build technology solutions that reflect the way local languages are spoken. We are grateful for the tireless work of our partners to realise this commitment, & are very pleased with the ongoing momentum. We look forward to sharing more updates soon.”