UHG
Search
Close this search box.

Garbage In, Garbage Out: The Problem Of Data Labeling

Share

In 2018, Amazon built a machine learning (ML) algorithm for hiring new candidates. However, because of substantial gender bias, it was scrapped soon. The fault did not lay in the algorithm but in the data that it was trained on. 

The ML algorithm was trained on Amazon’s previous hiring data. However, since the tech giant did not have a substantial ratio of men and women in jobs, the algorithm became biased towards men. Systematic bias was rationalised and reinforced through ML.

Researchers from San Diego, Berkeley, and Webster Pacific investigated the training process of algorithms. The human labelled data exposed various faults in the data labelling process. The researchers analysed the data labelling practices and provided recommendations to improve human labelling.

Supervised learning in ML allows one to collect data and produce an output from a previous experience. Models trained on labelled data are only as good as the quality of data. Most ML experts only focus on learning, research, education and all that is done when a good standard of data is achieved. Still, it is imperative to discuss the reliability of data in the first place. 

Scope of the study

The researchers examined 141 papers from fields of medicine, biology, and humanities. Out of the sample of all the papers presented, 27 percent used machine labelled data; 41 percent used human-labelled data; 27 percent used novel human labelled datasets; 5 percent failed to provide any information. 

Findings of the study 

Only half of the studies that used human labelling confessed to providing the labellers with documents and videos to aid during data labelling. Additionally, there was a vast difference in metrics used to rate whether annotators agreed or disagreed with particular labels. 

The study empirically investigates and discusses a wide range of issues and concerns around the production or labelling and use of training data. The curation process to produce high-quality datasets for ML models involves skills, expertise, and care– especially when humans individually label items. 

The results can be false when datasets are not gold standard, although they are assumed to be so. Supervised ML models are typically evaluated using held-out data from the original dataset. If a flaw were made in the training dataset, the result would be disastrous. The consequences are grave when such algorithms are used for subjective decision making like hiring, justice, and loan processing. 

Best practices

Most of the data in social sciences are in the form of structured content analysis. The methodology used in classification is to convert qualitative and unstructured data into categorical or quantitative data. The method would require human annotators and labellers.

Human labellers should be well paid along with aids and training to help them navigate the monotonous work. Structured content is a domain-specific task. It requires the labeller to have domain knowledge and domain-independent expertise to manage a team of labellers.

The rise of crowdsourcing culture, with platforms like Amazon Mechanical Turk, has led to a decrease in the accuracy of data labelling. However, it can be increased by using machine labelled data or training before hiring human labellers to perform the task.

Limitations

Even with the best of practices, there are some challenges in data labelling. To begin with– the lack of domain-specific expertise of labellers might misinterpret or miss critical details when examining datasets. Additionally, the reliability and reproducibility of data is also a concern. Furthermore, difficulties arise in getting a medium-sized team to build a consensus around reducing complex objects to quantifiable data. Moreover, between the field’s overall adherence to methodological best practices and researchers’ rates of reporting commitment to these practices, one can draw a hypothesis of an inverse relationship. This happens if such traditions become routine and mundane, leaving an implicit bias in publications. 

📣 Want to advertise in AIM? Book here

Picture of Meenal Sharma

Meenal Sharma

I am a journalism undergrad who loves playing basketball and writing about finance and technology. I believe in the power of words.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.