Longer Context, Deeper Thinking: Uncovering the Role of Long-Context Ability in Reasoning

Paper · arXiv 2505.17315 · Published May 22, 2025
EvaluationsReasoning Architectures

Recent language models exhibit strong reasoning capabilities, yet the influence of long-context capacity on reasoning remains underexplored. In this work, we hypothesize that current limitations in reasoning stem, in part, from insufficient long-context capacity, motivated by empirical observations such as i) higher context window length often leads to stronger reasoning performance, and ii) failed reasoning cases resemble failed long-context cases. To test this hypothesis, we examine whether enhancing a model’s long-context ability before Supervised Fine-Tuning (SFT) leads to improved reasoning performance. Specifically, we compared models with identical architectures and fine-tuning data but varying levels of long-context capacity. Our results reveal a consistent trend: models with stronger long-context capacity achieve significantly higher accuracy on reasoning benchmarks after SFT. Notably, these gains persist even on tasks with short input lengths, indicating that long-context training offers generalizable benefits for reasoning performance. These findings suggest that long-context modeling is not just essential for processing lengthy inputs, but also serves as a critical foundation for reasoning. We advocate for treating long-context capacity as a first-class objective in the design of future language models. Our code is anonymously available at https://github.com/uservan/LCTMerge.

Recent advances in long-context modeling have enabled language models (LLMs) to process substantially longer sequences. However, it remains unclear whether such long-context capacities yield tangible benefits for reasoning tasks. In this section, we empirically examine the relationship between long-context ability and reasoning performance. We collect a set of well-known open-source reasoning models fine-tuned from Qwen/Qwen2.5-7B-Instruct. These models are categorized into two groups based on their long-context capacity: 32k and 128k tokens. We then evaluate and compare their reasoning performance on two math reasoning benchmarks: MATH500 and AIME. The detailed results are reported in Table 1. Figure 1 presents that models with longer context lengths (128k vs. 32k) consistently achieve higher accuracy on mathematical reasoning benchmarks such as MATH500 and AIME. This suggests that the ability to encode and maintain longer contextual dependencies can directly translate into better reasoning capabilities. These results collectively highlight the importance of effective long-context training—not only for tasks involving long inputs, but also for general reasoning even when test-time inputs are relatively short.