Mark Zuckerberg seems to be on the right track as Meta prepares to unveil the next iteration of Llama 3.1, expected to be released today. The new version will come in three sizes: 8B, 70B, and 405B, with a context length of 128K.
However, even before Meta could officially release the model, its benchmark card was leaked, and it is doing rounds on social media.
According to the leaked information, Llama 3.1 has been trained on over 15 trillion tokens sourced from publicly available datasets. Their fine-tuning data comprises publicly available instruction-tuning datasets, along with an additional 15 million synthetic samples.
The models are explicitly advertised as multilingual, offering support for French, German, Hindi, Italian, Portuguese, Spanish, and Thai.
According to benchmarks, Llama 3.1 outperforms OpenAI’s GPT-4o in categories such as general knowledge, reasoning, reading comprehension, code generation, and multilingual capabilities. “Open-source is about to be SOTA — even the 70B is > gpt-4o, and this is before instruct tuning, which should make it even better,” posted a user on X.
Llama 3.1 405B achieves a macro average accuracy of 85.2% on the MMLU benchmark, whereas GPT-4o scores 87.5%. This indicates that while GPT-4o performs well, Llama 3.1 is highly competitive.
“The 70b is really encroaching on the 405b’s territory. I can’t imagine it being worthwhile to host the 405B. This feels like a confirmation that the only utility of big models right now is to distil from it,” posted another user.
Llama 3.1 405B is expected to be highly effective in generating datasets for smaller models. One user on Reddit pointed out that this could be a major advancement for “distillation”, likening it to the relationship between GPT-4 and GPT-4o.
They suggested using the 3.1 70B for “fast inference” and the Llama 3.1 405B for dataset creation and critical flows. “Who will use Llama-3.1-405B to create the best training datasets for smaller models?” asked Jiquan Ngiam, founder of Lutra AI.
“Honestly might be more excited for 3.1 70b and 8b. Those look absolutely cracked, must be distillations of 405b,” posted another user on Reddit, who goes by the name thatrunningguy.
OpenAI co-founder Andrej Karpathy also explained that in the future, as larger models help refine and optimise the training process, smaller models will emerge. “The models have to first get larger before they can get smaller because we need their (automated) help to refactor and mould the training data into ideal, synthetic formats.”
Last week, we saw the release of several small models that can be run locally without relying on the cloud. Small language models, or SLMs, are expected to become the future alongside generalised models like GPT-4 or Claude 3.5 Sonnet.
“For everyday use, an 8B or even a 70B LLM will suffice. If you don’t need to push a model to its limits, a SOTA model isn’t necessary for routine questions.”
OpenAI has just caught its breath
OpenAI’s recent compact and cost-effective model, GPT-4o mini, has excelled on benchmarks, achieving 82% on MMLU, 87% on MGSM for maths reasoning, and 87.2% on HumanEval for coding tasks. However, Meta’s Llama 3.1 70B Instruct is closely competitive, matching these impressive scores.
“GPT-4o mini, launched just 4 days ago, is already processing over 200 billion tokens per day! I’m very happy to hear how much people are enjoying the new model,” posted OpenAI chief Sam Altman on X.
OpenAI’s ongoing concern has been the computational resources required, which delays the development of their next frontier model. Notably, GPT-4o’s voice capabilities have not yet been made available, and Sora remains unpublished for general use.
Meanwhile OpenAI has been holding talks with chip designers including Broadcom about working on the chip to reduce its dependency on NVIDIA. Notably, CEO Jensen Huang personally hand-delivered the first NVIDIA DGX H200 to OpenAI.
OpenAI has recently begun training its next frontier model most likely to be GPT-5 and the company anticipates the resulting systems to bring us to the next level of capabilities on our path to AGI.
At Microsoft Build, CTO Kevin Scott said that if the system that trained GPT-3 was a shark and GPT-4 an orca, the model being trained now is the size of a whale. “This whale-sized supercomputer is hard at work right now,” he added.
“We’re bringing in the latest H200s to Azure later this year and will be among the first cloud providers to offer NVIDIA’s Blackwell GPUs in B100 as well as GB200 configurations,” said Microsoft chief Satya Nadella.
On the other hand, earlier this year, Zuckerberg announced that they are building massive compute infrastructure to support their future roadmap, including 350,000 H100s by the end of this year, and a total of nearly 600,000 H100-equivalent compute units.
With Llama 3.1, Meta has made it clear that their focus spans the entire LLM market, regardless of size. Rumours suggest that Meta has already begun training Llama 4, which is expected to be multimodal with audio features and integrated into the Meta Ray-Ban glasses.