Lately, there has been a debate in India’s AI ecosystem about whether the country should build its own foundational models. Some argue that India can address real-world problems by leveraging existing state-of-the-art models without spending millions on new ones.
Others, however, believe that it’s essential to develop models that profoundly understand the nuances, complexities, and rich diversity inherent in India’s myriad cultures and languages.
Amidst all this, an Indian state government has undertaken the task of developing an LLM that operates in the state’s official language.
In July, the Information Technology, Electronics & Communications (ITE&C) department in Telangana hosted a datathon aimed at creating a Telugu LLM.
Carried out in partnership with Swecha, a non-profit and free open-software movement in Telangana, the datathon was organised to help build datasets which, in turn, will help train the Telugu LLM.
Building Telugu Datasets
Building effective LLMs for Indian languages remains a challenging task due to the scarcity of high-quality data. While ChatGPT is impressive because it is trained on multiple terabytes of data, such extensive datasets are not available for Indian languages.
To develop datasets in Telugu, the Telangana government is tapping into its rich education system. Around 1 lakh undergraduate students across all engineering colleges in Telangana took part in the datathon and collected data from ordinary citizens who use Telugu as their mother tongue.
The team collected data from oral sources such as folk tales, songs, local histories, and information about food and cuisine. Additionally, they plan to dispatch volunteers to approximately 7,000 villages across the state to gather audio and video samples of people discussing various topics, which were then converted into content.
(Source: @nutanc)
Interestingly, this is not the first instance when such an exercise was undertaken. Last year, the same Swecha team developed a Telugu SLM, named ‘AI Chandamama Kathalu’, from scratch.
To collect data for the model, a similar datathon was organised with volunteers from Swecha, in collaboration with nearly 25-30 colleges. Over 10,000 students participated in translating, correcting, and digitising 40,000-45,000 pages of Telugu folk tales.
Building LLMs
Ozonetel, which is an industry partner for the project along with DigiQuanta and TechVedika, supported it by training the model and providing the necessary compute.
The team tried fine-tuning Google’s MT-5 open-source model, Meta’s Llama, and Mistal. However, they finally settled on building a model similar to GPT-2 from scratch. Training the model on a cluster of NVIDIA’s A100 GPUs took nearly a week.
Now, the aim is to develop a larger model and have it ready to be showcased at the Telangana govt’s Global AI Summit, scheduled to take place in September this year.
Moreover, developing a large model could cost millions of dollars. For instance, building something like ChatGPT could cost in the billions. However, the team aims to develop the Telugu LLM at a cost of around INR5-10 lakh.
India’s Efforts to Build AI Models in Regional Languages
Over time, we have seen efforts to build AI models in regional languages. For instance, Abhinand Balachandran, assistant manager at EXL, released a Telugu version of Meta’s open-source LLama 2 model.
Similarly, in April this year, a freelance data scientist released Nandi– built on top of Zephyr-7b-Gemma, the model boasts 7 billion parameters and is trained on Telugu Q&A data sets curated by Telugu LLM Labs.
Interestingly, such models have not been just limited to Telugu. We have seen AI models built on top of open-source models such as Tamil Llama, and Marathi LLama, among others.
However, these models could be seen as mere experiments. But the Telangana government’s effort to develop an AI model in Telugu has the potential to make significant strides in advancing regional language technology and preserving cultural heritage.
Officials involved in the project have told the media that voice commands from devices such as Alexa are not available in Telugu and this platform will pave the way for such innovations.