UHG
Search
Close this search box.

AI4Bharat Rolls Out IndicLLMSuite for Building LLMs in Indian Languages

It covers 22 languages with 251 billion tokens and 74.8 million instruction-response pairs. 

Share

Listen to this story

Even though India’s contribution to Indic LLMs has skyrocketed in the last year, the lack of open-source pipelines for low and mid-resource languages hinders their representation in LLM training datasets. 

To address this, AI4Bharat has created IndicLLMSuite, a collection of resources for Indic LLMs covering 22 languages with 251 billion tokens and 74.8 million instruction-response pairs. Let’s take a look at some of the kit’s key resources. 

Sangraha

This includes data for pre-training data containing 251B tokens summed up over 22 languages, extracted from curated URLs, existing multilingual corpora, and large-scale translations. 

Setu

This is a Spark-based distributed pipeline customized for Indian languages for extracting content from websites, PDFs, and videos. It has in-built stages for cleaning, filtering, toxicity removal, and deduplication.

IndicAlign-Instruct

It offers a varied set of 74.7 million prompt-response pairs in 20 languages. 

These pairs are gathered using four methods, including compiling existing Instruction Fine-Tuning (IFT) datasets, translating English datasets into 14 Indian languages with an open-source translation model, generating discussions from India-centric Wikipedia articles using open-source LLMs, and setting up a crowdsourcing platform named Anudesh for prompt collection. The team has also introduced a new IFT dataset to teach language and grammar to the model, drawing from IndoWordNet, a resource-rich vocabulary. 

IndicAlign – Toxic

Finally, we have IndicAlign – Toxic, which consists of 123K pairs of toxic prompts and non-toxic responses generated using open-source English LLMs and translated to 14 Indian languages for safety alignment of Indic LLMs.

You can access the data and codes here. 

Earlier this month, Sarvam AI, along with AI4Bharat and IIT Madras, unveiled IndicVoices, a comprehensive speech dataset adhering to an inclusive diversity wishlist with fair representation of demographics, domains, languages, and applications. The IndicVoices dataset comprises  7348 hours of natural and spontaneous speech from 16237 speakers across 145 Indian districts and 22 languages. 

📣 Want to advertise in AIM? Book here

Picture of Shritama Saha

Shritama Saha

Shritama (she/her) is a technology journalist at AIM who is passionate to explore generative AI with a special focus on big techs, database, healthcare, DE&I, hiring in tech and more.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.