UHG
Search
Close this search box.

The Missing Link for Indian Language Chatbots: Indic Data 

“You will see many claiming that they can make a chatbot or LLM for Indian languages; 99% of those are transient,” said Raj Dabre.

Share

The Missing Link for Indian Language Chatbots: Indic Data

Illustration by Raghavendra Rao

Listen to this story

In recent times, there has been a noticeable upswing in the efforts to build Indic language models. And even though some of these models are adequate for various tasks, their adoption remains abysmally low compared to their ‘superior’ English counterparts. A huge challenge here is the availability of Indic languages datasets.

In a conversation with AIM, Raj Dabre, a prominent researcher at NICT in Kyoto, adjunct faculty at IIT Madras and a visiting professor at IIT Bombay, discussed the complexities of developing chatbots for Indian languages.

“These models [GPT-3] have seen close to tens of trillions of tokens or words in English. Unless you have seen the entirety of the web, or more or less all of it, none of these models will be able to actually solve the generative AI problem for that [Indian] language,” said Dabre. 

The crux of the issue lies in the sheer lack of digitised data in Indian languages. Since English holds a major grip over the internet, the digitised content available for Indian languages remains vastly insufficient.

Bridging the Gap with RomanSetu

Dabre rued that chatbots for Indian languages are still a dream. “You will see a lot of people claiming that they can make a chatbot or LLM for Indian languages, but 99% of those things are transient. They are not going to be too useful in production, because nobody has solved the data problem yet,” said Dabre.

He explained that unless a monolingual dataset is created in Indian languages, which is at parity with the scale of English datasets, we won’t be able to build chatbots that answer in Indic languages without faltering in some way. “However, there are other strategies to transfer the capabilities of English to Indian languages. If someone figured that out, the data problem might be half solved,” said Dabre.

To tackle the issue, Dabre, with researchers from AI4Bharat, IIITDM Kancheepuram, A*STAR Singapore, Flipkart, Microsoft India, and  IIT Madras, introduced the RomanSetu paper, which explains a technique that unlocks multilingual capabilities of LLMs via Romanisation. 

If you type something in Hindi but in the Roman script, current AI models like Llama and others are able to process it to some degree. That is what Dabre and his team are working on—to train models in Romanised versions of Indic data, leverage the knowledge in the English language, and transfer it to Indic languages.

The team found out through the RomanSetu paper that this approach actually works better than training models on native scripts like Devanagari. “It is like a shortcut,” said Dabre. “We cannot properly pursue the goal of building the next-big LLM for Indic languages unless we solve the data problem.”

Additionally, Dabre is currently working on speech translation models, which have a huge demand in India. He is also one of the creators of IndicBART and IndicTrans2 models at AI4Bharat, founded by his seniors at IIT Bombay. 

The team created the IndicNLG Benchmark around the time GPT-3 was launched. But there was not much conversation around Indic language generation at that time. And now that ChatGPT is here, everyone is into building chatbots.

Dabre’s journey with NLP began over a decade ago at IIT Bombay under the guidance of Pushpak Bhattacharya, the former president of the Association of Computational Linguistics and a leading figure in the field of Indic NLP.

He later did his PhD at Kyoto University, under the guidance of Sadao Kurohashi, the director of the National Institute of Informatics, Japan, and a leading figure in Japanese NLP. 

Training Big Indic Models Can Be a Waste of Compute

Giving the example of Sangraha dataset by AI4Bharat, which has 251 billion tokens of data in 22 languages, and citing Microsoft’s Phi models, Dabre said that given the scale of data, it is more appropriate for India to build a 1 billion parameter model instead of a larger one. 

“A 1 billion parameter model might just do a decent job,” he added, saying that there is not enough data to train such a big model, and called it an injudicious use of compute. However, once the data scales improve, models should also scale.

Another paper by Dabre is assisting researchers to train models with synthetic data. The paper, Do Not Worry if You Do Not Have Data, highlights that translating English documents into Indian languages causes no harm and can roughly have the same amount of knowledge and performance. “You can get by, but it is not an ideal solution; the data needs to be cleaned,” said Dabre.

He concluded that unless these models have an Indian context, they can’t work very well with Indian languages. 

📣 Want to advertise in AIM? Book here

Picture of Mohit Pandey

Mohit Pandey

Mohit dives deep into the AI world to bring out information in simple, explainable, and sometimes funny words.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.