The Indic data problem has been on every Indian AI researcher’s mind. One of the primary challenges in developing AI for Indian languages is the scarcity of high-quality data. Unlike English, which has a vast amount of digital content available, Indian languages lack sufficient natural data to train AI models.
“For Indian languages, or any language other than English, this has not been true [availability of data online], because of which we do not have a large amount of natural data,” Vivekananda Pani, the CTO and co-founder of Reverie Language Technologies, told AIM.
Various approaches are being considered to address this data problem. One involves using artificial methods to augment existing data, such as converting English data into Indian languages using machine translation.
However, Pani cautioned against relying solely on this approach. “The best way is not, let’s say, if we try to augment and continue to augment forever… because in the English language that’s not likely to happen,” he said.
Building a Rocket for Indic Languages
Pani advocates for a more sustainable solution: enabling people to create a large amount of natural data that can be fed back into AI systems. This includes leveraging content from various media formats, such as audio and video, which can be transcribed and used to train AI models.
Speaking about the need for building something like ChatGPT in India, Pani gave the perfect analogy. “Scaling the moon is actually a lot of distance. If you walk, it will take 25 lives, versus if you build a rocket today, you will reach it in a month.”
One problem that Pani pointed out is the standardisation of the Indian language in the digital world. Giving the example of the word “sahayogi”, Pani and his team explained that there are several ways in which people type the word on their phones.
This is also because there is no uniformity for Indian languages across different types of keyboards, while English is uniform across the board.
“Even if somebody doesn’t know English, but wants to send a message on WhatsApp, they tend to write the message in their native language using English letters,” Pani explained that it is only for the digital world, but when people write in books and letters, it is usually in native scripts.
Earlier while speaking to AIM, Raj Dabre, a prominent researcher at NICT in Kyoto, discussed the complexities of developing chatbots for Indian languages. If you type something in Hindi but in the Roman script, current AI models like Llama and others are able to process it to some degree.
Dabre and his team are working on training models in Romanised versions of Indic data, leveraging the knowledge in the English language, and transferring it to Indic languages.
“Technology has actually been a barrier in our case,” said Pani. He explained that this is something that needs to be addressed very quickly.
Working on the same issue is Pushpak Bhattacharyya, who was recently appointed as the chairman for the committee for standardisation in artificial intelligence, set up by the Bureau of Indian Standards (BIS)
Bhattacharyya argues that while existing LLMs can be adapted for Indian languages, creating specialised foundational models could lead to significant efficiency gains. Similarly, smaller models with less amount of data would also not be feasible in the longer run.
Native Languages for AI is a Need
At Reverie Language Technologies, Pani explained that the team has been working on this problem for a very long time, and building that ‘AI rocket’.
“We started in an era when there was absolutely zero Indian language data in the digital media,” he recalled, highlighting the progress made from their first speech model using only 100 hours of data to more recent models utilising at least 10,000 hours.
“In India, we still have less than 7% people who are fluent in English,” Pani explained that there is definitely a need for building an AI model in Indian languages, and not just relying on the models by the West.
Pani also addressed the startup scenario in India for fundamental research in AI. He notes that while historically, there has been a lack of investment in long-term research projects, the landscape is changing. “Now that OpenAI showed the world what is possible, people have a belief that this can be achieved and therefore let’s go and invest,” he observed.
Another issue is building hardware in India, which Pani explained is a lot harder. “On that front, we are far behind right now. We don’t have enough skills. We will have to import skills and it’s a very, very expensive process,” he said.
Agreeing with Bhavish Aggarwal’s recent comment on comparing OpenAI with East India Company, Pani said that a lot of data that Indian users “donate” to companies such as Google through their phones is kept with the company and is only accessible to Indian researchers if they pay for it.
“When you look at countries such as China and Japan, they have their computing world in their native language,” Pani added, saying there needs to be a bigger push from the government for putting standards and baseline fundamentals.
“These fundamentals today are actually governed by a few American companies and formed by them. So they would not really think of what is right for our country,” Pani concluded.