The assertion that “AI models are only as good as the data they’re trained on” has been reiterated by experts countless times, and the statement affirms the undeniable truth.
ChatGPT, the popular chatbot by OpenAI, is so impressive because it has been trained on a dataset called WebText2, which is a library of over 45 terabytes of text data which includes articles, blogs, forums, and other publicly available written content.
However, for India to flourish in India, it needs to create AI models that understand India’s linguistic nuances and cultural complexities.
So far, we have already seen a host of Indic large language models (LLMs) come up within a short period of time from Tech Mahindra and Sarvam AI on an enterprise level.
Besides, various Indic adaptations of Meta’s widely used open-source LLM, LLama, have been created for languages like Tamil, Kannada, Telugu, and more. However, the critical challenge lies in the necessity for robust data on the diverse set of languages spoken across the country, a resource currently lacking in building effective Indic LLMs.
The necessity to develop good Indic datasets
India boasts linguistic diversity, encompassing hundreds of languages and thousands of dialects. Generating datasets for such a vast array of languages is a challenging endeavour. Presently, even though datasets are available for the 22 official languages, hundreds of other languages are actively spoken in the country.
“Large language models require very large amounts of high-quality data. For many Indian languages, we do not have this right now,” Pratyush Kumar, co-founder at Sarvam AI and AI4Bharat, an open-source AI research initiative focused on Indian language AI, told AIM.
However, several initiatives are in progress to collect datasets for Indic languages. One such effort is Project Vaani, a collaboration between the Indian Institute of Sciences (IISc) and Google, which focuses on gathering speech datasets to facilitate the training of Indic LLMs.
Additionally, AI4Bharat is also building Indic language datasets like ‘IndicCorp v2, which is the most extensive collection of texts for Indic languages, comprising 20.9 billion tokens, out of which 14.4 billion tokens cover 23 Indic languages. And then there is Bhashini, which is an Indian government’s initiative to create an open-source Indic language dataset.
However, even though these are good datasets, there is a need for more data. “The current volume of this type of data is relatively small; we need to collect even more,” Vivek Raghavan, co-founder of Sarvam AI, told AIM.
Collecting data is not easy
Building datasets remains the most challenging task for those wanting to build Indic LLMs. It will involve digitisation of books, collaborating with linguists, engaging communities, organising content creation workshops, and partnering with local institutions. Surveys, interviews, and transcription services will also play a crucial role.
Tech Mahindra, which recently completed developing its Hindi LLM consisting of 539 million parameters and 10 billion Hindi+ dialect tokens, sent its crew to Northern India to collect data.
“We went to Madhya Pradesh, Rajasthan, and parts of Bihar. The team’s task was to collect Hindi and dialect data by interacting with professors and leveraging the Bhasha-dan portal available on ProjectIndus.in,” Nikhil Malhotra, global head at Makers Lab, Tech Mahindra and the brain behind Project Indus, told AIM.
He also highlighted the initiative to involve Tech Mahindra employees in contributing everyday interactions through sentences like ‘Main ghar se bahar jata hoon’ to the portal to gather diverse linguistic prompts.
Similarly, Swecha Telangana, an open-source advocacy group, recently collaborated with 25-30 colleges and over 10,000 students, who were involved in translation, correction and digitalisation of 40,000-45,000 pages of Telugu folk tales. The dataset was then used to train a 7 billion parameter Telugu small language model (SLM) called ‘AI Chandamama Kathalu’.
Bhashini, too, is making similar efforts to build Indic language datasets. “We lack digital data on low-resource languages, such as Bodo or Sindhi. For these languages, we approach individuals proficient in both Bodo and English,” Amitabh Nag, CEO at Bhashini, told AIM. He explained that they help create a training dataset by providing parallel text in English corresponding to the content written in Bodo.
These efforts stand as a testament to the arduous task ahead for those involved in building good Indic LLMs. Nonetheless, Krutrim, which is an AI startup by Ola founder Bhavesh Aggarwal, claims to have built a model from scratch which understands the 22 official languages of India.
Similarly, earlier this month, Qx Lab AI launched Ask Qx, which also understands 22 official languages. However, neither company has revealed the details of the datasets their models have been trained on.
It’s worth highlighting that ChatGPT, leveraging the capabilities of both GPT-3.5 and GPT-4, exhibits multilingual proficiency, including an understanding of Indic languages like Hindi, Bengali, and Assamese.
However, the chatbot occasionally encounters challenges in distinguishing between Bengali and Assamese. This issue arises due to the quality of datasets on which the models have been trained, particularly for these specific languages.
An ecosystem needs to develop
Kumar believes building good datasets for the 22 official languages will take a reasonable while and, “We should as an ecosystem continue to focus on that because we want to support these languages,” he said.
What India truly requires is a unified and collaborative endeavour, where individuals, organisations, and enterprises join forces, extending support whenever necessary. The encouraging aspect is that a significant portion of the ongoing initiatives is adopting an open-source approach.
This not only promotes transparency and accessibility but also facilitates a collective and inclusive effort towards the common goal of advancing linguistic technologies in the country.
“I believe you will observe a notable shift, particularly in languages spoken more frequently, gaining momentum sooner. However, with time, I am confident that we have the capability to encompass all languages, including their various dialects and nuances,” Raghavan said.