In 2018, Amazon built a machine learning (ML) algorithm for hiring new candidates. However, because of substantial gender bias, it was scrapped soon. The fault did not lay in the algorithm but in the data that it was trained on.
The ML algorithm was trained on Amazon’s previous hiring data. However, since the tech giant did not have a substantial ratio of men and women in jobs, the algorithm became biased towards men. Systematic bias was rationalised and reinforced through ML.
Researchers from San Diego, Berkeley, and Webster Pacific investigated the training process of algorithms. The human labelled data exposed various faults in the data labelling process. The researchers analysed the data labelling practices and provided recommendations to improve human labelling.
Supervised learning in ML allows one to collect data and produce an output from a previous experience. Models trained on labelled data are only as good as the quality of data. Most ML experts only focus on learning, research, education and all that is done when a good standard of data is achieved. Still, it is imperative to discuss the reliability of data in the first place.
Scope of the study
The researchers examined 141 papers from fields of medicine, biology, and humanities. Out of the sample of all the papers presented, 27 percent used machine labelled data; 41 percent used human-labelled data; 27 percent used novel human labelled datasets; 5 percent failed to provide any information.
Findings of the study
Only half of the studies that used human labelling confessed to providing the labellers with documents and videos to aid during data labelling. Additionally, there was a vast difference in metrics used to rate whether annotators agreed or disagreed with particular labels.
The study empirically investigates and discusses a wide range of issues and concerns around the production or labelling and use of training data. The curation process to produce high-quality datasets for ML models involves skills, expertise, and care– especially when humans individually label items.
The results can be false when datasets are not gold standard, although they are assumed to be so. Supervised ML models are typically evaluated using held-out data from the original dataset. If a flaw were made in the training dataset, the result would be disastrous. The consequences are grave when such algorithms are used for subjective decision making like hiring, justice, and loan processing.
Best practices
Most of the data in social sciences are in the form of structured content analysis. The methodology used in classification is to convert qualitative and unstructured data into categorical or quantitative data. The method would require human annotators and labellers.
Human labellers should be well paid along with aids and training to help them navigate the monotonous work. Structured content is a domain-specific task. It requires the labeller to have domain knowledge and domain-independent expertise to manage a team of labellers.
The rise of crowdsourcing culture, with platforms like Amazon Mechanical Turk, has led to a decrease in the accuracy of data labelling. However, it can be increased by using machine labelled data or training before hiring human labellers to perform the task.
Limitations
Even with the best of practices, there are some challenges in data labelling. To begin with– the lack of domain-specific expertise of labellers might misinterpret or miss critical details when examining datasets. Additionally, the reliability and reproducibility of data is also a concern. Furthermore, difficulties arise in getting a medium-sized team to build a consensus around reducing complex objects to quantifiable data. Moreover, between the field’s overall adherence to methodological best practices and researchers’ rates of reporting commitment to these practices, one can draw a hypothesis of an inverse relationship. This happens if such traditions become routine and mundane, leaving an implicit bias in publications.