Listen to this story
|
Amazon’s four-billion dollar baby Anthropic recently released Claude 3, a family of generative AI models called Haiku, Sonnet and Opus, which surpasses GPT-4 on prominent benchmarks, including near-instant results and strong reasoning capabilities. It has also outperformed Gemini 1.0 Pro and is at par or shows competitive results with Gemini 1.0 Ultra.
Longer Context Length
The Claude 3 model series debuts with a 200,000-token context window, a jump from 100,000 tokens in the second version of Claude. However, these models are flexible in accommodating inputs surpassing one million tokens for selected customers.
In contrast, Gemini 1.5 shows a substantial leap in performance, leveraging advancements in research and engineering across foundational model development and infrastructure. Notably, Gemini 1.5 Pro, the first model released for early testing, introduces a mid-size multimodal architecture optimised for diverse tasks. Positioned at a performance level akin to 1.0 Ultra, Gemini 1.5 Pro pioneers a breakthrough experimental feature in long-context understanding.
On the other hand, Gemini 1.5 has a 128,000 token context window. Still, like Claude, it allows a select group of developers and enterprise customers to explore an extended context window of up to one million tokens via AI Studio and Vertex AI in private preview.
Unfortunately, the weakest in this space is OpenAI’s GPT-4, which sets a maximum context length of 32,000 tokens. However, GPT-4 Turbo can process up to 128,000 tokens.
Improved Reasoning and Understanding
Another interesting feature that has caught everyone’s attention is the ‘Needle In A Haystack’ (NIAH) evaluation approach taken by Anthropic, gauging a model’s accuracy in recalling information from a vast dataset.
Effective processing of lengthy context prompts demands models with strong recall abilities. Claude 3 Opus not only achieved nearly perfect recall, surpassing 99% accuracy, but also demonstrated an awareness of evaluation limitations, identifying instances where the ‘needle’ sentence seemed artificially inserted into the original text by a human.
During an NIAH evaluation, which assesses a model’s recall ability by embedding a target sentence (“needle”) into a collection of random documents (“haystack”), Opus exhibited an unexpected behaviour. It used 30 random needle/question pairs per prompt to enhance the benchmark’s robustness and tested on a diverse corpus of crowdsourced documents.
In a recount of internal testing on Claude 3 Opus, Alex Albert, prompt engineer at Anthropic, shared that during an NIAH evaluation of the model, it seemed to suspect that the team was running Eval on it. When presented with a question about pizza toppings, Opus produced an output that included a seemingly unrelated sentence from the documents.
The context of this sentence appeared out of place compared to the overall document content, which primarily focused on programming languages, startups, and career-related topics. The suspicion arose that the pizza-topping information might have been inserted as a joke or a test to assess attention, as it did not align with the broader themes. The documents lacked any other information about pizza toppings.
So, Opus not only successfully identified the inserted needle but also demonstrated meta-awareness by recognising the needle’s incongruity within the haystack. This prompted reflection on the need for the industry to move beyond artificial tests.
Several users, who have tried Claude 3 Opus, are so impressed by its reasoning and understanding skills that they feel the model has reached AGI. For example, its apparent intrinsic worldview, shaped by the Integral Causality framework, is appreciated. Claude 3’s worldview is characterised by holism, development, embodiment, contextuality, perspectivism, and practical engagement.
Other reactions from the community that discuss Claude 3’s potential status as AGI are its ability to reinvent quantum algorithms, its intrinsic worldview, and even its comprehension of a complex quantum physics paper.
Another aspect highlighted by NVIDIA’s Jim Fan is the inclusion of domain expert benchmarks in finance, medicine, and philosophy, which sets Claude apart from models that rely solely on saturated metrics like MMLU and HumanEval. This approach provides a more targeted understanding of performance in specific expert domains, offering valuable insights for downstream applications.
Secondly, Anthropic addresses the issue of overly cautious answers from LLMs with a refusal rate analysis. It emphasises efforts to mitigate overly safe responses to non-controversial questions.
However, it is also important to note that people should not overinterpret Claude-3’s perceived “awareness”. Fan believes that a simpler explanation is that instances of apparent self-awareness are outcomes of pattern-matching alignment data crafted by humans. This process is similar to asking GPT-4 about its self-consciousness, where a sophisticated response is likely shaped by human annotators adhering to their preferences.
Even though the topic has been the talk of the town since OpenAI released GPT-4 in March 2023, Anthropic’s Claude 3 falls short. This raises an important question: How close are we to AGI? And, most importantly, who is leading that race?