India boasts over 400 languages and a rich linguistic tapestry but faces the challenge of bridging the digital divide, which is exacerbated by the dominance of English in LLMs. Perpetually hungry for data, large language models are extensively trained on online information. However, the absence of non-English language data and the abundance of vast offline data can be leveraged with OCR.
Optical Character Recognition (OCR), which is the process of transforming an image containing text into a machine-readable text format, digitises content into data that can be used for analytics, automation, training AI models and other processes. With the function to extract data, OCR enables LLMs to analyse and process the said data.
Here are a few OCR tools that can aid developers and coders train AL/ML models.
Best OCR Software with Machine Learning in 2024
Surya
Surya, a multilingual text line detection model designed for document OCR, has been trained on diverse documents, including scientific papers. The training ensures that Surya excels in detecting text lines within documents, delivering pinpoint accuracy in line-level bounding boxes and clear identification of column breaks in PDFs and images.
Bhashini
Bhashini, an app developed to help people translate content in different Indian languages, recently introduced an OCR feature, called SCENE. The feature allows users to extract text by simply scanning an image using the camera. Bhashini was recently used by the Prime Minister Narendra Modi to address students during ‘Pariksha Pe Charcha’.
Tesseract OCR
Tesseract OCR is an open-source OCR engine maintained by Google. It was first developed by Hewlett-Packard, and later taken over by Google. Tesseract has unicode (UTF-8), supports more than 100 languages and can be integrated with LLMs to extract text from images. It also supports various image formats such as PNG, JPEG, TIFF.
PyTesseract
Python-Tesseract serves as an optical character recognition (OCR) utility for Python. Essentially, it is capable of identifying and interpreting the text contained within images. Python-tesseract acts as a wrapper for Google’s Tesseract-OCR Engine.
It proves handy as a standalone execution script for Tesseract, capable of interpreting all image formats supported by the Pillow and Leptonica imaging libraries, such as jpeg, png, gif, bmp, tiff, among others. Furthermore, when employed as a script, Python-tesseract outputs the recognized text directly rather than storing it in a file.
EasyOCR
EasyOCR is a Python package that provides a straightforward interface for performing OCR tasks. It is an open-source OCR engine that supports multiple languages and can be used with LLMs for text recognition and data extraction. It also offers pre-trained models for various use cases.
OpenCV
OpenCV (Open Source Computer Vision) is a collection of programming functions primarily focused on real-time computer vision tasks. While it may require more customisation, it can be used in conjunction with LLMs for OCR tasks.
In Python, OpenCV facilitates image processing by providing functions for tasks such as image resizing, pixel manipulation, object detection, and more.
OCRopus
OCRopus is another open-source OCR engine that is designed for high accuracy and efficiency. It includes various preprocessing and post-processing techniques suitable for AI and ML applications. OCRopus commands typically display a stack trace alongside an error message, but this does not necessarily indicate a problem.
Kraken
Kraken is an OCR engine implemented in Python and optimised for historical and degraded document recognition. It can be used in AI and ML models for tasks involving challenging document images. Kraken can be run on Linux or Mac OS X (both x64 and ARM).
Resources