LLMs have demonstrated remarkable capabilities, from generating coherent and contextually relevant text, to solving complex mathematical problems. However, these models also exhibit puzzling inconsistencies, such as struggling with seemingly simple tasks.
This phenomenon has led to the concept of what Andrej Karpathy calls “Jagged Intelligence,” a term that captures the uneven performance of LLMs across different types of tasks.
The Paradox of LLM Performance
To prove this, Karpathy recently gave an example, where he asked an LLM-based chatbot to compare two numbers – 9.11 and 9.9. The task to determine which was larger. Despite the simplicity of the task, the model provided an incorrect answer.
Further illustrating these challenges, Karpathy said that even OpenAI’s GPT-4o failed to provide the correct answer to such simple comparisons approximately one-third of the time.
Similarly, Anthropic’s Claude struggled with this task in all three attempts, highlighting a notable inconsistency in its performance.
These problems are not unique. Noam Brown, a research engineer at OpenAI, experimented with LLMs by making them play basic games like tic-tac-toe. The outcome was dismal, since LLMs performed poorly in this simple game.
Additionally, another developer tasked GPT-4 with solving tiny Sudoku puzzles. Here too, the model struggled, often failing to solve these seemingly simple puzzles.
This pattern of inconsistent performance across different tasks underscores the current limitations of LLMs, despite their impressive achievements in more complex areas.
Intelligently Dumb
It’s concerning that while our most advanced models can win a silver medal in a Math Olympiad, they can also fail to answer a simple question like “which number is bigger, 9.11 or 9.9?”
This inconsistency might seem baffling, but NVIDIA’s Senior Research Manager Jim Fan has an explanation: training data distribution.
Models like AlphaProof and AlphaGeometry-2 are specifically trained on formal proofs and domain-specific symbolic data. This specialised training makes them experts at solving Olympiad problems, even though they are based on general-purpose LLMs.
GPT-4o is trained on a lot of different data, including tons of code from GitHub. There is likely much more code data than maths data. In software versioning, a higher version number like “v9.11” is considered greater than “v9.9”. This might have confused the model when comparing simple numbers, causing it to make a mistake
Can’t Blame LLM Models Alone
A study by Google DeepMind proved that LLMs lack genuine understanding and, as a result, cannot self-correct or adjust their responses on command.
A recent example by a developer illustrated this limitation. He requested a piece of code using a specific framework (Framework X) to accomplish a particular task (Task Y).
Unaware that the requested task was not feasible with the given framework, the LLM initially provided code that functioned correctly but did not achieve the desired outcome.
When the user pointed out the discrepancy, the LLM then generated code that syntactically met the requirements but was inherently non-functional. This back-and-forth continued twice, with the LLM failing to recognize the incorrectness of its solutions.
AI-Clever Charlatans
A very important point to ponder is that LLMs are created by humans. As Karprathy himself pointed out that all of its training data in the last, post-training stage are of the form [question -> authoritative sounding solution], where the solutions are written by humans. The LLMs just imitate the form/style of that training data.
Therefore, since LLM relies on imitating the style of the training data, it can sometimes generate answers that sound authoritative but are factually incorrect or misleading.
When Karpathy asked the LLM to compare two numbers, the model provided an incorrect answer. However, when the same prompt was modified with the instruction to “think step by step before answering,” the LLM was able to arrive at the correct answer.
What’s the Solution?
Recently, OpenAI released a research paper on Prover-Verifier Games’ (PVG) for LLMs which can solve this problem. It involves two distinct roles: the Prover, which generates solutions, and the Verifier, which checks the validity of these solutions.
By implementing this system, we can significantly improve the consistency, accuracy, and trustworthiness of LLMs.
Interestingly, the PVG system alludes to a form of reinforcement learning, something OpenAI’s co-founder and former chief scientist Ilya Sutskever was a strong advocate of.
Causality also offers a framework for improving the intelligence and reliability of LLMs. Causal reasoning would enable AI to understand cause-and-effect relationships, similar to human reasoning.
In an exclusive interview with AIM, Rohit Bhattacharya, assistant professor of computer science at Williams College said, “We don’t require millions of data points in order to do the things that we do. Reducing the amount of data needed to make these machines function the way they do is one big aspect where causal reasoning can play a role.”
This shift could help overcome limitations in data dependency and improve generalisation to new, unfamiliar situations.