Long-context LLMs Struggle with Long In-context Learning

Paper · arXiv 2404.02060 · Published April 2, 2024
Test Time Compute

We introduce a benchmark (LongICLBench) for long in-context learning in extreme-label classification using six datasets with 28 to 174 classes and input lengths from 2K to 50K tokens. Our benchmark requires LLMs to comprehend the entire input to recognize the massive label spaces to make correct predictions. We evaluate on 15 long-context LLMs and find that they perform well on less challenging classification tasks with smaller label space and shorter demonstrations. However, they struggle with more challenging task like Discovery with 174 labels, suggesting a gap in their ability to process long, context-rich sequences. Further analysis reveals a bias towards labels presented later in the sequence and a need for improved reasoning over multiple pieces of information. Our study reveals that long context understanding and reasoning is still a challenging task for the existing LLMs.

In this paper, we propose to adopt in-context learning (ICL) on extreme-label classification tasks (Anil et al., 2022; Milios et al., 2023) to evaluate long-context LLMs. Unlike the prior tasks, in-context learning requires LLMs to recognize the task by scanning over the entire input to understand the label space. This task necessitates LLMs’ ability to comprehend the entire input to make predictions. Due to the massive label space, the task demonstration could easily become a long sequence. For example, Discovery (Sileo et al., 2019) encompasses 174 classes with each example taking an average of 61 tokens. Therefore, the minimum total demonstration length (1 shot per class) already exceeds 10K tokens. Normally, LLMs demand more than 1 shot per class to understand the nuances of different fine-grained labels. Having multiple shots can significantly extend the total demonstration length to above 32K. Therefore, this task becomes a natural testbed for long-context understanding.