UHG
Search
Close this search box.

IISc to Open Source Project Vaani Under Bhashini with 16,000 Hours of Speech Data

The ambitious project aims to curate datasets of 150,000 hours of natural speech and text from approximately one million people across 773 districts in India.

Share

IISc to Open Source Project Vaani Under Bhashini with 16,000 Hours of Speech Data

The Indian Institute of Science (IISc) AI and Robotics Technology Park (ARTPARK) is set to open-source 16,000 hours of spontaneous speech data from 80 districts as part of Project Vaani, under the Ministry of Electronics and Information Technology’s flagship AI initiative, BHASHINI

This effort is in collaboration with the Google.

The ambitious project aims to curate datasets of 150,000 hours of natural speech and text from approximately one million people across 773 districts in India. The first phase of the project, launched at the end of 2022, is nearing completion.

In its second phase, Project Vaani will target 160 districts, collecting 200 hours of speech data from about 1,000 people per district. So far, voice data in 58 different language variants or dialects has been gathered from 80 districts and will soon be made publicly available.

Prasanta Ghosh, Assistant Professor in the Department of Electrical Engineering at IISc, leading Project Vaani, mentioned, “Our journey in the Indic language data sets started in 2020 with RESPIN (recognising speech in Indian languages).”

When speaking with AIM, Ghosh said, “We record people with impairments and are building technology that can understand them. Maybe humans can’t, but AI would be able to,” said Ghosh. He said that while collecting the data, he realised that there is no corpus even to build technologies for healthy people. “And then the first thing came to my mind was the idea to build a good foundational model for them.”

Apart from healthcare, Ghosh is also focused on using AI for education in India on his project called English Gyani. Funded by the Ministry of Science and Technology, it is also called the IMPRINT project. “We’re building tools to help young jobseekers learn spoken English and comprehension skills right from their homes,” said Ghosh.

“Bhashini, for the first time, helped create the digital data for low resource languages in its effort to build AI models for low resource languages,” said Amitabh Nag, Chief Executive of Bhashini. 

The primary goal is to use this dataset as training data for speech-to-text AI models, particularly benefiting conversational AI platforms and chatbots requiring diverse voice datasets. Organisations can leverage this data to develop speech models for various Indian languages and dialects, aligning with their business objectives.

A Google spokesperson highlighted the project’s significance, stating, “Project Vaani was born out of the deep need to make high-quality, diverse & anonymised datasets available for people to build technology solutions that reflect the way local languages are spoken. We are grateful for the tireless work of our partners to realise this commitment, & are very pleased with the ongoing momentum. We look forward to sharing more updates soon.”

📣 Want to advertise in AIM? Book here

Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.