Listen to this story
|
A great example of self-supervised learning is the way humans learn. We “learn from experience” and see the world around us. This can be done through experimentation, observation and tests. In a recent interaction with Analytics India Magazine, the guru of self-supervised learning Yann LeCun explained why such methods are key to the future of artificial intelligence.
Giving us the context from a human equivalence, LeCun said that the average human has the ability to process about ten images per second in a span of 100 milliseconds. By the time humans are five years old, they have already seen about a billion frames. Interestingly, Google, Instagram, and YouTube produce the same amount of images in hours. “We have more data than we can use, but we don’t know how to use it,” said LeCun, in the backdrop of some of the challenges faced by self-supervised learning models.
Limitations of self-supervised learning
If you look at the evolution of self-supervised learning models, it initially came into the picture in response to the challenges posed by supervised learning models. This includes carrying labelled data, which is expensive and sometimes practically impossible. As a result, from a purely pragmatic view of short-term applications, there is a huge push to deploy more powerful self-supervised learning models.
But, on the flip side, self-supervised models face a much bigger issue of being loaded with poor-quality data—alongside scaling of models as it has the possibility of being trained on mislabelled data—leading to more bias and false output.
But, LeCun says otherwise. He believes that the main issue is not the unavailability of data but how learning systems can take advantage of the available data. For example, the amount of exposure to language an infant needs to learn the language is quite small compared to the billions of words or pieces of text that language models have to be exposed to in order to perform well.
Similarly, when it comes to games like Chess or Go—which are designed to be difficult for humans—machines using reinforcement learning can do well. But, achieving such a feat requires enormous data equivalent to several lifetimes of full-time playing by humans. In a nutshell, machines are not very efficient at using data. A good way to progress here, according to LeCun, will be to discover new running patterns that allow machines to run with less data.
LeCun, in a recent tweet, said that the impact of self-supervised models has been much larger than he had predicted. The success of models like ChatGPT, text-to-anything generation along with the advancements made in protein folding models attest sufficiently to that.
Problems galore
Self-supervised learning as an ideal only works for large corporations like Meta, which possess terabytes of data to train state-of-the-art models. Additionally, there are several challenges when it comes to self-supervised learning. First, as opposed to a supervised learning model, the self-supervised model minimises the human’s role in the process. This means that there is a high chance that it would mislabel data, leading to errors in the output. Moreover, the costs of bad data have been hefty for businesses, with Gartner claiming out that—on an average—businesses lose nearly $9.7 million each year.
Contrary to LeCun’s claims, several researchers perceive that we may run out of data. For instance, in analysing the growth of dataset sizes in machine learning, Villalobos et al estimated that over the coming decades, the total stock of unlabelled data will be exhausted soon. Their projections suggest that by 2026, we will be nearing the end of high-quality data, whereas poor-quality data will last through anytime between 2030 and 2050. Thus, ML models’ growth might slow down unless data efficiency becomes a focus or new data sources are made available.
Read: What is Stopping Generative AI from Achieving Growth?
Likewise, Manu Joseph, creator of PyTorch Tabular, told AIM, “Collecting more data to train LLMs is a challenge since there is a shortage of good-quality test data, and most of the text on the internet is duplicated.”
However, the mountainous task of data efficiency is not yet a failed cause.
Solving one model at a time
Take, for instance, a recent study which showed that large language models (LLMs) can self-improve even with unlabelled datasets. Prior to this study, the latest research portrayed that fundamentally improving the model performances above few-shot baselines still necessitates fine-tuning a sizable number of high-quality supervised datasets. According to the study, however, the models could enhance their performance on reasoning datasets by training on their own generated labels, given input questions only. The research also shows that an LLM can self-improve even on its own generated questions and few-shot Chain-of-Thought prompts.
Further, the Deepmind research team recently released a paper showcasing that its Epistemic Neural Networks (ENNs) enable fine-tuning large models with 50% less data. Beyond the traditional framework of Bayesian neural networks, the team introduced ENNs—designed using an epinet—which is “an architecture that can supplement any conventional neural network, including large pretrained models, and can be trained with modest incremental computation to estimate uncertainty.” The researchers claim that ENNs will considerably improve the tradeoffs in prediction quality and computation.
An important issue with LLMs, they highlight, is that they cannot distinguish irreducible uncertainty over the next token. So, the team takes a different approach, relying on the epinet’s uncertainty estimations to assist models to “know what they don’t know” and increase their data efficiency to overcome current approaches to the problem that generally requires adding more training data.