Wait… Did OpenAI Just Solve ‘Jagged Intelligence’?

“9.11 > 9.9” is now in the OpenAI docs as a problem solved by requesting JSON-structured output to separate final answers from supporting reasoning.

Share

Illustration by Diksha Mishra

Published on August 7, 2024

by Siddharth Jindal

Today, OpenAI came up with its latest update featuring Structured Outputs in its API, which claims to enhance the model’s reasoning by ensuring precise and consistent adherence to output schemas. This is as demonstrated by gpt-4o-2024-08-06, achieving “100% reliability” in evals, perfectly matching the output schemas, ensuring accurate and consistent data generation.

Funnily, the OpenAI docs include the problem of “9.11 > 9.9” as an example resolved using JSON structured output to distinguish between final answers and supporting reasoning. It was in reference to the term ‘Jagged Intelligence’ coined by Andrej Karpathy for LLMs’ struggle with dumb problems.

This new feature helps ensure that the responses from models follow a specific set of rules (called JSON Schemas) provided by developers. OpenAI said that they took a deterministic, engineering-based approach to constrain the model’s outputs to achieve 100% reliability.

“OpenAI finally rolled out structured outputs in JSON, which means you can now enforce your model outputs to stick to predefined schemas. This is super handy for tasks like validating data formats on the fly, automating data entry, or even building UIs that dynamically adjust based on user input,”posted a user on X.

OpenAI has used the technique of constrained decoding. Normally, when a model generates text, it can choose any word or token from its vocabulary. This freedom can lead to mistakes, such as adding incorrect characters or symbols.

Constrained decoding is a technique used to prevent these mistakes by limiting the model’s choices to tokens that are valid according to a specific set of rules (like a JSON Schema).

A Stop-Gap Mechanism for Reasoning?

Arizona State University, Prof Subbarao Kambhampati argues that while LLMs are impressive tools for creative tasks, they have fundamental limitations in logical reasoning and cannot guarantee the correctness of their outputs.

He said that GPT-3, GPT-3.5, and GPT-4 are poor at planning and reasoning, which he believes involves time and action. These models struggle with transitive and deductive closure, with the latter involving the more complex task of deducing new facts from existing ones.

Kambhampati aligns with Meta AI chief Yann LeCun, who believes that LLMs won’t lead to AGI and that researchers should focus on gaining animal intelligence first.

“Current LLMs are trained on text data that would take 20,000 years for a human to read. And still, they haven’t learned that if A is the same as B, then B is the same as A,” Lecun said.

He has even advised young students not to work on LLMs. LeCun is bullish on self-supervised learning and envisions a world model that could learn independently.

Can language models be used as world models?
Not really.

No world model, no planning. https://t.co/GGQVeIAzEZ
— Yann LeCun (@ylecun) June 15, 2024

“In 2022, while others were claiming that LLMs had strong planning capabilities, we said that they did not,” said Kambhampati, adding that their accuracy was around 0.6%, meaning they were essentially just guessing.

He further added that LLMs are heavily dependent on the data they are trained on. This dependence means their reasoning capabilities are limited to the patterns and information present in their training datasets.

Explaining this phenomenon, Kambhampati said that when the old Google PaLM was introduced, one of its claims was its ability to explain jokes. He said, “While explaining jokes may seem like an impressive AI task, it’s not as surprising as it might appear.”

“There are humour-challenged people in the world, and there are websites that explain jokes. These websites are part of the web crawl data that the system has been trained on, so it’s not that surprising that the model could explain jokes,” he explained.

He added that LLMs like GPT-4, Claude, and Gemini are ‘stuck close to zero’ in their reasoning abilities. They are essentially guessing plans for the ‘Blocks World’ concept, which involves ‘stacking’ and ‘unstacking, he said.

This is consistent with a recent study by DeepMind, which found that LLMs often fail to recognise and correct their mistakes in reasoning tasks.

The study concluded that “LLMs struggle to self-correct their reasoning without external feedback. This implies that expecting these models to inherently recognise and rectify their reasoning mistakes is overly optimistic so far”.

Meanwhile, OpenAI reasoning researcher Noam Brown also agrees. “Frontier models like GPT-4o (and now Claude 3.5 Sonnet) may be at the level of a “smart high schooler” in some respects, but they still struggle on basic tasks like tic-tac-toe,” he said.

Interestingly, Apple recently used the standard prompt engineering for a bunch of their Apple Intelligence features, and someone on Reddit found the prompts.

The Need for a Universal Verifier

To tackle the problem of accuracy, OpenAI has introduced a prover-verifier model to enhance the clarity and accuracy of language model outputs.

In the Prover-Verifier Games, two models are used, the Prover, a strong language model that generates solutions, and the Verifier, a weaker model that checks these solutions for accuracy. The Verifier determines whether the Prover’s outputs are correct (helpful) or intentionally misleading (sneaky).

Kambhampati said, “You can use the world itself as a verifier, but this idea only works in ergodic domains where the agent doesn’t die when it’s actually trying a bad idea.”

He further said that even with end-to-end verification, a clear signal is needed to confirm whether the output is correct. “Where is this signal coming from? That’s the first question. The second question is, how costly will this be?”

Also, OpenAI is currently developing a model with advanced reasoning capabilities, known as Q* or Project Strawberry. Rumours suggest that, for the first time, this model has succeeded in learning autonomously using a new algorithm, acquiring logical and mathematical skills without external influence.

Kambhampati is somewhat sceptical about this development as well. He said, “Obviously, nobody knows whether anything was actually done, but some of the ideas being discussed involve using a closed system with a verifier to generate synthetic data and then fine-tune the model. However, there is no universal verifier.”

Chain of Thought Falls Short

Regarding the Chain of Thought, Kambhampati said that it basically gives the LLM advice on how to solve a particular problem.

Drawing an analogy with the Block World problem, he explained that if you train an LLM to solve three- or four-block stacking problems, they could improve their performance on these specific problems. However, if you increase the number of blocks, their performance significantly dies.

Kambhampati quipped that Chain of Thought and LLMs remind him of the old proverb, “Give a man a fish, and you feed him for a day, teach a man to fish, and you feed him for a lifetime.”

“Chain of Thought is actually a strange version of this,” he said. “You have to teach an LLM how to catch one fish, then how to catch two fish, then three fish, and so on. Eventually, you’ll lose patience because it’s never learning the actual underlying procedure,” he joked.

Moreover, he said that this doesn’t mean AI can’t perform reasoning. “AI systems that do reasoning do exist. For example, AlphaGo performs reasoning, as do reinforcement learning systems and planning systems. However, LLMs are broad but shallow AI systems. They are much better suited for creativity than for reasoning tasks.”

Google DeepMind’s AlphaProof and AlphaGeometry, based on a neuro-symbolic approach, recently won a Silver Medal at the International Maths Olympiad. Many, to an extent, feel that neuro-symbolic AI will prevent generative AI bubbles from exploding.

Last year, AIM discussed the various approaches taken by big-tech companies, namely OpenAI, Meta, Google DeepMind, and Tesla, in the pursuit of AGI. Since then, tremendous progress has been made.

Lately, it’s likely that OpenAI is working on Causal AI, as their job postings, such as for data scientists, emphasise expertise in causal inference.

LLM-based AI Agents will NOT Lead to AGI

Recently, OpenAI developed a structured framework to track the progress of its AI models toward achieving artificial general intelligence (AGI). OpenAI CTO Mira Murati claimed that GPT-5 will reach a PhD-level of capability, while Google’s Logan Kilpatrick anticipates AGI will emerge by 2025.

Commenting on the hype around AI agents, Kambhampati said, “I am kind of bewildered by the whole agentic hype because people confuse acting with planning.”

He further explained, “Being able to make function calls doesn’t guarantee that those calls will lead to desirable outcomes. Many people believe that if you can call a function, everything will work out. This is only true in highly ergodic worlds, where almost any sequence will succeed and none will fail.”

📣 Want to advertise in AIM? Book here

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.