Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Paper · arXiv 2409.12941 · Published September 19, 2024
RAGReasoning o1 o3 Search

Large Language Models (LLMs) have demonstrated significant performance improvements across various cognitive tasks. An emerging application is using LLMs to enhance retrieval-augmented generation (RAG) capabilities. These systems require LLMs to understand user queries, retrieve relevant information, and synthesize coherent and accurate responses. Given the increasing real-world deployment of such systems, comprehensive evaluation becomes crucial. To this end, we propose FRAMES (Factuality, Retrieval, And reasoning MEasurement Set), a high-quality evaluation dataset designed to test LLMs’ ability to provide factual responses, assess retrieval capabilities, and evaluate the reasoning required to generate final answers. While previous work has provided datasets and benchmarks to evaluate these abilities in isolation, FRAMES offers a unified framework that provides a clearer picture of LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions that require the integration of information from multiple sources. We present baseline results demonstrating that even state-of-the-art LLMs struggle with this task, achieving 0.40 accuracy with no retrieval. The accuracy is significantly improved with our proposed multi-step retrieval pipeline, achieving an accuracy of 0.66 (>50% improvement). We hope our work will help bridge evaluation gaps and assist in developing more robust and capable RAG systems.

Our work addresses a critical void in the current landscape by offering a challenging evaluation benchmark that not only tests the individual components of LLMs but also evaluates their performance in an end-to-end context. Through our dataset, we simulate realistic, multi-document queries to assess the ability of LLMs to retrieve relevant facts, reason accurately, and synthesize information into coherent responses. Additionally, we present empirical results on the performance of state-of-the- art models, highlighting both their strengths and the limitations in their reasoning capabilities. These findings pave the way for further research and development of more robust and efficient retrieval-augmented generation systems. Our key contributions are as follows:

• We introduce FRAMES, a novel dataset of 824 test samples designed to evaluate LLMs’ ability to retrieve and reason across multiple documents in a unified framework.

• We provide a comprehensive evaluation of state-of-the-art LLMs, highlighting their performance on factuality, retrieval, and reasoning tasks across diverse domains.

• We present new empirical insights into the limitations of existing LLMs in handling multihop and temporal reasoning tasks, offering avenues for future research to improve these systems.

• We propose a multi-step retrieval and reasoning framework that compels models to iteratively retrieve and reason, significantly enhancing their performance on complex queries.