UHG
Search
Close this search box.

The Brain Behind the Much-Needed Multilingual Text Line Detection Model

A self-taught AI enthusiast and developer, Vik Paruchuri, believes his OCR model, Surya, would help create low-resource Indic language datasets and models.

Share

Multilingual Text Line Detection Model

A multilingual optical character recognition (OCR), which goes by the name Surya (aka Sun) has been making a lot of noise on X (formerly Twitter), as well as on GitHub and Hugging Face for all the right reasons. “Nice. This used to be really, really hard,” said Meta AI’s chief Yann LeCun

Surya—a multilingual text line detection model designed for document OCR, has been trained on diverse documents, including scientific papers. The training ensures that Surya excels in detecting text lines within documents, delivering pinpoint accuracy in line-level bounding boxes and clear identification of column breaks in PDFs and images.

AIM got in touch with Vik Paruchuri, the creator of Surya, who highlighted the model’s versatility at length, asserting that it works seamlessly with every language tested. In India, its scope is huge. 

“I think Surya has the potential to vastly increase how much of that data is available,” referring to the dearth of Indian language datasets, especially in regions where the majority of content is not on the web, particularly low resource indic languages. 

“So this model can protect texts in all Indian languages,” he added, saying that the impact on Indian LLMs would be substantial. That makes sense, as Surya promises to be a valuable asset for detecting and recognising text in regional languages.

Gaining feedback and appreciation, particularly from the likes of LeCun and others, Paruchuri with a smile said: “It’s just a huge honour, and I can’t get over the fact that he took a look at it.” 

Inception of Surya

Paruchuri—a former US diplomat, shared his journey with AIM. It’s truly inspirational.  

“I’m a self-taught data scientist, and started a company called Dataquest that teaches data science and AI, which I ran for a long time.” Reflecting on his shift in focus, he added, “I stepped back last year to focus more on building, to work on AI. I spent a lot of time learning about AI and started just training models and experimenting.”

Paruchuri reiterated the significance of data quality in model training, stating, “I realised that data is the biggest limiting factor for many of these models. The better your data, the better your model is going to be. But not enough people are working on the data quality issue.” 

Committed to addressing this challenge, he decided to concentrate on training models in this domain, emphasising the importance of open-source data and tools. He said, “You need open-source data and tools to be able to train. Really, really don’t want that to be restricted to the companies like OpenAI.”

Paruchuri, a passionate proponent of open-source, left his company, which has reached a million people and helped learn data science to build something worthwhile. 

“I didn’t know if I’d be able to build something useful or even if I could learn deep learning, so it’s been amazing actually to see people using my libraries and finding them useful,” said Paruchuri—-who also hailed Hugging Face as a great company. 

This is huge because, Google’s Tesseract has long been dominating the OCR. Surya can be a better alternative. 

Tech Stack 

Paruchuri told AIM that he employed a modified segformer architecture from NVIDIA, using a transformer for text line detection, particularly the encoder and decoder technique to turn an image into text.  

“The text line detection model uses an encoder, employing an architecture called ‘former,’ which uses an encoder to encode the image and then a decoder to give you a mask that tells you where the text is,” explained Paruchuri, saying, “They all use PyTorch and Hugging Face transformers.”

Paruchuri further explained that the model underwent training on a mix of real and synthetic documents, incorporating about 2 million images with associated labels. The model was also trained from scratch, boasting approximately 100 million parameters.

Paruchuri also mentioned that the performance of Surya was validated across various languages using the known data set “doclaynet,” demonstrating its effectiveness in diverse linguistic contexts.

What’s next? 

Paruchuri built his model using NVIDIA’s RTX A6000 for training in the middle of the GPU crunch and is of the opinion that “there’s a lot of spaces where you can train like 100 million state-of-the-art parameter models.”

Surya could be used in different industries with a lot of documents and be used to design better AI models or extract data that could be used in services.

He also mentioned that he would like to release his datasets eventually to aid this. 

This, in turn, could aid the efforts being made in India by several initiatives like Bhashini, AI4Bharat at IIT-Madras; Project Vaani, Sarvam AI’s OpenHathi series, and CoRover.ai’s BharatGPT in building LLMs trained specifically in Indian languages. 

India boasts over 400 languages and a rich linguistic tapestry but faces the challenge of bridging the digital divide exacerbated by the dominance of English in LLMs. Amitabh Nag, CEO of the Indian government’s Bhashini, has also said that despite the progress made by Bhashini, acquiring local datasets remains a formidable challenge. 

Nag revealed that many of the 22 official Indian languages lack digital data, hindering AI model development. Surya can help bridge the gap, help India lead in the AI race, and potentially create millions of jobs in data labelling and annotation.  

📣 Want to advertise in AIM? Book here

Picture of Shyam Nandan Upadhyay

Shyam Nandan Upadhyay

Shyam is a tech journalist with expertise in policy and politics, and exhibits a fervent interest in scrutinising the convergence of AI and analytics in society. In his leisure time, he indulges in anime binges and mountain hikes.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.