Does reasoning ability actually degrade with longer inputs?
Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.
The FLenQA benchmark exposes a critical gap between technical context window capacity and actual reasoning capacity over long inputs. By embedding simple reasoning tasks (True/False questions requiring integration of two information pieces) within irrelevant padding text of varying lengths, the paper shows that reasoning accuracy drops from 0.92 to 0.68 at just 3000 tokens — far below any modern model's context window.
Three findings make this particularly concerning:
1. The degradation is task-agnostic. Regardless of whether padding text is similar or dissimilar to the reasoning content, and regardless of where the information pieces are embedded within the context, similar degradation trends appear. The failure is not about content interference but about attention dilution over length.
2. Next-word prediction performance is uncorrelated with reasoning performance. Models that maintain strong perplexity on long inputs still fail at reasoning over those inputs. This means language modeling benchmarks on long contexts are misleading indicators of actual long-context utility — a model can "understand" the text (predict tokens well) while failing to reason over it.
3. CoT does not mitigate proportionally. Chain-of-thought prompting increases accuracy roughly uniformly across context lengths but does not close the length-induced gap. The degradation persists under CoT because the bottleneck is in information retrieval from context, not in reasoning over retrieved information.
This is a complementary mechanism to Why do language models fail at temporal reasoning in complex tasks?. That failure is about task complexity; this is about input noise. Together they define a two-dimensional reliability surface: reasoning degrades with both task complexity AND input length, and the two dimensions are independent.
The implication for RAG systems is direct: retrieved documents add to input length, and if that length includes irrelevant passages (as it typically does), reasoning over the retrieved content degrades even when the relevant information is present. Since Why does vanilla RAG produce shallow and redundant results?, the length degradation explains part of why static retrieval fails — more retrieved documents means more padding means worse reasoning.
A complementary training-time finding complicates this picture. "Longer Context, Deeper Thinking" (2025) shows that models with stronger long-context capacity (128k vs 32k) consistently achieve higher accuracy on mathematical reasoning benchmarks (MATH500 and AIME) — even when test-time inputs are short. Long-context training benefits reasoning as a foundation, not just for processing long inputs. The implication: the inference-time degradation documented in this note coexists with a training-time benefit. Models trained on longer contexts develop better reasoning foundations, but at inference time, longer inputs still degrade performance. The two findings are compatible: long-context training may improve the base reasoning capability, while inference-time input length introduces the noise and distraction effects that degrade it. Source: Arxiv/Evaluations.
Source: Reasoning Logic Internal Rules
Related concepts in this collection
-
Why do language models fail at temporal reasoning in complex tasks?
Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.
complementary failure axis: task complexity vs input length
-
Why does vanilla RAG produce shallow and redundant results?
Standard RAG systems get stuck in a single semantic neighborhood because their initial query determines what documents are discoverable. The question asks whether fixed retrieval strategies fundamentally limit knowledge depth compared to iterative exploration.
RAG retrieval adds length; length degrades reasoning
-
Does more thinking time actually improve LLM reasoning?
The intuition that extended thinking helps LLMs reason better seems obvious, but what does the empirical data actually show when we test it directly?
another dimension where "more" (tokens) ≠ "better" (reasoning)
-
Can long-context models resolve retriever-reader imbalance?
Traditional RAG systems force retrievers to find precise passages because readers had small context windows. Do modern long-context LLMs change what architecture makes sense?
challenges the long-context solution: reader burden increases with length but reasoning degrades
-
Do vector embeddings actually measure task relevance?
Vector embeddings rank semantic similarity, but RAG systems need topical relevance. When these diverge—as with king/queen versus king/ruler—does similarity-based retrieval fail in production?
compounds the length problem: semantic retrieval returns associated-but-irrelevant documents, creating exactly the irrelevant padding that FLenQA shows degrades reasoning; imprecise retrieval directly produces the input-length degradation
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
reasoning performance degrades with input length even far below context window limits