In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Paper · arXiv 2402.10790 · Published February 16, 2024

Abstract This paper addresses the challenge of processing long documents using generative transformer models. To evaluate different approaches, we introduce BABILong, a new benchmark designed to assess model capabilities in extracting and processing distributed facts within extensive texts. Our evaluation, which includes benchmarks for GPT-4 and RAG, reveals that common methods are effective only for sequences up to 104 elements. In contrast, fine-tuning GPT-2 with recurrent memory augmentations enables it to handle tasks involving up to 11 × 106 elements. This achievement marks a substantial leap, as it is by far the longest input processed by any neural network model to date, demonstrating a significant improvement in the processing capabilities for long sequences.

Introduction. Memory is an essential component of both natural and artificial cognitive systems. Different types of memory represent general knowledge about some domain, facts, and episodic or specific information. Training of neural models encodes representations for general knowledge and repeating facts in network parameters. On the other hand, task-specific information is provided as a part of the model’s input. This type of information is particularly important because it conditions the generation of a solution. Recent progress in machine learning has resulted in the extension of input size for commonly used models by three orders of magnitude, from hundreds to hundreds of thousands of elements. However, further increase in input sequence length is limited by the quadratic scaling of compute required for the calculation of self-attention in transformers. One of the promising approaches to extend the context window of transformers is augmentation with recurrent memory (Bulatov et al., 2022).

Discussion / Conclusion. This work explores capability of current transformer based solutions in extremely long context processing. For that we propose BABILong, a novel benchmark designed to assess the performance of NLP models on long documents with distributed facts. The benchmark offers algorithmically adaptable document length and content placement, making it leak-proof for future large language models (LLMs). Our findings reveal limitations in popular LLMs like GPT-4 and RAG regarding effective long context utilization. Their performance heavily relies on the first 25 % of the input, highlighting the need for improved context processing mechanisms. We demonstrate the effectiveness of recurrent memory augmentation of transformer models. This approach achieves superior performance on documents up to 11 million tokens. Notably, the recurrent memory enables multi-hop reasoning, a crucial ability for tackling complex tasks. Attention map analysis provides valuable insights into how the model leverages the stored memory to identify relevant facts. This work not only sets a new record for processing of 11 million tokens but also strengthens the hypothesis that recurrent memory excels at filtering out irrelevant information compared to solely relying on attention mechanisms.

In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss

Synthesis notes that discuss concepts related to this paper