Last updated January 24, 2024
In AI Origins & Evolution

The Brain Behind the Much-Needed Multilingual Text Line Detection Model

A self-taught AI enthusiast and developer, Vik Paruchuri, believes his OCR model, Surya, would help create low-resource Indic language datasets and models.

Share

Published on January 24, 2024

by Shyam Nandan Upadhyay

A multilingual optical character recognition (OCR), which goes by the name Surya (aka Sun) has been making a lot of noise on X (formerly Twitter), as well as on GitHub and Hugging Face for all the right reasons. “Nice. This used to be really, really hard,” said Meta AI’s chief Yann LeCun.

Surya—a multilingual text line detection model designed for document OCR, has been trained on diverse documents, including scientific papers. The training ensures that Surya excels in detecting text lines within documents, delivering pinpoint accuracy in line-level bounding boxes and clear identification of column breaks in PDFs and images.

AIM got in touch with Vik Paruchuri, the creator of Surya, who highlighted the model’s versatility at length, asserting that it works seamlessly with every language tested. In India, its scope is huge.

“I think Surya has the potential to vastly increase how much of that data is available,” referring to the dearth of Indian language datasets, especially in regions where the majority of content is not on the web, particularly low resource indic languages.

“So this model can protect texts in all Indian languages,” he added, saying that the impact on Indian LLMs would be substantial. That makes sense, as Surya promises to be a valuable asset for detecting and recognising text in regional languages.

Gaining feedback and appreciation, particularly from the likes of LeCun and others, Paruchuri with a smile said: “It’s just a huge honour, and I can’t get over the fact that he took a look at it.”

Inception of Surya

Paruchuri—a former US diplomat, shared his journey with AIM. It’s truly inspirational.

“I’m a self-taught data scientist, and started a company called Dataquest that teaches data science and AI, which I ran for a long time.” Reflecting on his shift in focus, he added, “I stepped back last year to focus more on building, to work on AI. I spent a lot of time learning about AI and started just training models and experimenting.”

Paruchuri reiterated the significance of data quality in model training, stating, “I realised that data is the biggest limiting factor for many of these models. The better your data, the better your model is going to be. But not enough people are working on the data quality issue.”

Committed to addressing this challenge, he decided to concentrate on training models in this domain, emphasising the importance of open-source data and tools. He said, “You need open-source data and tools to be able to train. Really, really don’t want that to be restricted to the companies like OpenAI.”

Paruchuri, a passionate proponent of open-source, left his company, which has reached a million people and helped learn data science to build something worthwhile.

“I didn’t know if I’d be able to build something useful or even if I could learn deep learning, so it’s been amazing actually to see people using my libraries and finding them useful,” said Paruchuri—-who also hailed Hugging Face as a great company.

This is huge because, Google’s Tesseract has long been dominating the OCR. Surya can be a better alternative.

Tech Stack

Paruchuri told AIM that he employed a modified segformer architecture from NVIDIA, using a transformer for text line detection, particularly the encoder and decoder technique to turn an image into text.

“The text line detection model uses an encoder, employing an architecture called ‘former,’ which uses an encoder to encode the image and then a decoder to give you a mask that tells you where the text is,” explained Paruchuri, saying, “They all use PyTorch and Hugging Face transformers.”

Paruchuri further explained that the model underwent training on a mix of real and synthetic documents, incorporating about 2 million images with associated labels. The model was also trained from scratch, boasting approximately 100 million parameters.

Paruchuri also mentioned that the performance of Surya was validated across various languages using the known data set “doclaynet,” demonstrating its effectiveness in diverse linguistic contexts.

What’s next?

Paruchuri built his model using NVIDIA’s RTX A6000 for training in the middle of the GPU crunch and is of the opinion that “there’s a lot of spaces where you can train like 100 million state-of-the-art parameter models.”

Surya could be used in different industries with a lot of documents and be used to design better AI models or extract data that could be used in services.

He also mentioned that he would like to release his datasets eventually to aid this.

This, in turn, could aid the efforts being made in India by several initiatives like Bhashini, AI4Bharat at IIT-Madras; Project Vaani, Sarvam AI’s OpenHathi series, and CoRover.ai’s BharatGPT in building LLMs trained specifically in Indian languages.

India boasts over 400 languages and a rich linguistic tapestry but faces the challenge of bridging the digital divide exacerbated by the dominance of English in LLMs. Amitabh Nag, CEO of the Indian government’s Bhashini, has also said that despite the progress made by Bhashini, acquiring local datasets remains a formidable challenge.

Nag revealed that many of the 22 official Indian languages lack digital data, hindering AI model development. Surya can help bridge the gap, help India lead in the AI race, and potentially create millions of jobs in data labelling and annotation.

📣 Want to advertise in AIM? Book here

Shyam Nandan Upadhyay

Shyam is a tech journalist with expertise in policy and politics, and exhibits a fervent interest in scrutinising the convergence of AI and analytics in society. In his leisure time, he indulges in anime binges and mountain hikes.