Is Data Annotation Dying?

AI is coming for data annotation jobs. How ironic.

Share

Published on June 25, 2024

by Mohit Pandey

Voxel51 co-founder and University of Michigan professor of robotics Jason Corso recently put up an article on data annotation being dead, which sparked discussions on LinkedIn and X alike.

One may have assumed that the wave of generative AI would make data annotation jobs even more abundant. But that is the exact same reason why these jobs are slowly becoming obsolete.

Industry leaders specialising in computer vision solutions stress that despite advancements in AI, meticulously curated, high-quality image-annotated datasets remain essential. These datasets are critical as operations scale and diversify, and the notion that untested technologies could disrupt established workflows is not only impractical but potentially harmful.

Moreover, human-created datasets are proving even more relevant in fields beyond computer vision, extending to generative AI and multimodal workflows. There have been several reports about companies such as OpenAI, Amazon, and Google acquiring cheap labour in countries such as India or even Kenya for labelling and annotating data for training AI models.

In India, companies such as NextWealth, Karya, Appen, Scale AI, and Labelbox are creating jobs within the country, specifically in rural areas, for data annotation. When speaking with AIM, NextWealth MD and founder Sridhar Mitta said, “The beauty of GenAI is that it allows people from remote areas to do these tasks.”

So, are these companies are about to slowly die?

Not So Dead After All

Human annotation has played a pivotal role in the AI boom, providing the foundation for supervised machine learning. The process involves manually labelling raw data to train machine learning algorithms in pattern recognition and predictive tasks.

While labour-intensive, this approach ensures the creation of reliable and accurate datasets. It turns out that the need for human-generated datasets is even more crucial now than ever before.

The only disruption possible is with self-supervised learning. An engineering leader in AI, Tilmann Bruckhaus, said, “These techniques reduce the need for manual labelling by using noisy or automatically-generated labels (weak supervision) or enabling models to learn from unlabeled data (self-supervision).”

Corso believes that human annotation will be needed for gold-standard evaluation datasets, which will also be combined with auto-annotation in the future.

This process involves using AI models to automatically label data. While this approach shows promise, its applicability is limited. Auto-annotation is most useful in scenarios where the model’s performance needs to be adapted to new environments or specific tasks. However, for general applications, a reliance on auto-annotation remains impractical.

Adding to all of this is how current AI models are increasingly relying on synthetic data. SkyEngine AI CEO Bartek Włodarczyk said that with synthetic data “one does not have to worry about data labelling as any masks can be instantly created along with data.”

Dangerous Times Ahead?

Though one can clearly say that human annotation will be the gold standard in the future, if companies fail to adapt and thrive with the current boom, many of them will have to face dangerous times ahead. People For AI founder and data labelling director Matthieu Warnier said, “As labelling tasks become more automated, the ones that remain are notably more complex. Selecting the right labelling partner has become even more crucial.”

This was also reflected by Hugging Face co-founder and CSO Thomas Wolf. “It’s much easier to quickly spin and iterate on a pay-by-usage API than to hire and manage annotators. With model performance strongly improving and the privacy guarantee of open models, it will be harder and harder to justify making complex annotation contracts,” said Wolf, further stating that these will be some dangerous times for data annotation companies.

It seems like manual data annotation might take a backseat when it comes to labelling data for AI training with models such as YOLOv8 or Unitlab’s SAM that can annotate almost anything without a need for human intervention.

On the other hand, manual data annotations will remain a premium service, but the numbers are definitely expected to drop. Companies that are utilising workers in different parts of the world to create high-quality datasets will have to cut down on costs soon.

So, the data annotation market might see major shifts when it comes to adapting to the changing landscape. While the size is definitely set to decrease, the manual data annotation companies will be the ones who set the golden standard, making themselves the benchmark for the automated data annotation market.

📣 Want to advertise in AIM? Book here