Listen to this story
|
Researchers at Meta AI have created a new benchmark called CRAG (Comprehensive Retrieval-Augmented Generation Benchmark) to spur advancements in retrieval-augmented question answering systems that combine large language models with external knowledge sources.
The goal is to develop more reliable and trustworthy question answering capabilities that overcome hallucinations and knowledge gaps in today’s language models.
The CRAG benchmark consists of 4,409 question-answer pairs spanning finance, sports, music, movies, and general topics.
It includes diverse question types like comparisons, aggregations, multi-hop queries, and false premises. The dataset incorporates facts with varying dynamism from real-time to static, as well as varying entity popularity from head to long-tail.
Crucially, CRAG provides mock web search results and APIs to simulate retrieving information from the internet and knowledge graphs. This allows benchmarking the full pipeline of retrieval, synthesis, and generation required for knowledge-grounded question answering.
Evaluations highlighted major gaps in current systems. The most advanced language models achieved only 34% accuracy on CRAG, while straightforward retrieval-augmentation improved this to just 44%.
Even industry-leading retrieval-augmented systems answered only 63% of questions without hallucinations, struggling especially with dynamic, long-tail, and complex queries.
“CRAG reveals the challenges in building fully trustworthy question answering systems that can reliably incorporate information from the real world,” said Xiao Yang, a research scientist at Meta AI and co-lead of the project. “We hope this benchmark spurs innovation and allows tracking progress toward this critical goal.”
The CRAG dataset formed the basis for the KDD Cup 2024 challenge hosted by Meta AI, attracting thousands of participants working to advance retrieval-augmented generation capabilities. The researchers plan to continue expanding and improving CRAG to push forward research in this area.