LLM Reasoning and Architecture Knowledge Retrieval and RAG

Does reasoning ability actually degrade with longer inputs?

Explores whether modern language models can maintain reasoning performance when processing long contexts, and whether technical capacity translates to practical reasoning capability over extended text.

Note · 2026-02-22 · sourced from Reasoning Logic Internal Rules
What makes chain-of-thought reasoning actually work? How should researchers navigate LLM reasoning research? Where do retrieval systems break and why?

The FLenQA benchmark exposes a critical gap between technical context window capacity and actual reasoning capacity over long inputs. By embedding simple reasoning tasks (True/False questions requiring integration of two information pieces) within irrelevant padding text of varying lengths, the paper shows that reasoning accuracy drops from 0.92 to 0.68 at just 3000 tokens — far below any modern model's context window.

Three findings make this particularly concerning:

1. The degradation is task-agnostic. Regardless of whether padding text is similar or dissimilar to the reasoning content, and regardless of where the information pieces are embedded within the context, similar degradation trends appear. The failure is not about content interference but about attention dilution over length.

2. Next-word prediction performance is uncorrelated with reasoning performance. Models that maintain strong perplexity on long inputs still fail at reasoning over those inputs. This means language modeling benchmarks on long contexts are misleading indicators of actual long-context utility — a model can "understand" the text (predict tokens well) while failing to reason over it.

3. CoT does not mitigate proportionally. Chain-of-thought prompting increases accuracy roughly uniformly across context lengths but does not close the length-induced gap. The degradation persists under CoT because the bottleneck is in information retrieval from context, not in reasoning over retrieved information.

This is a complementary mechanism to Why do language models fail at temporal reasoning in complex tasks?. That failure is about task complexity; this is about input noise. Together they define a two-dimensional reliability surface: reasoning degrades with both task complexity AND input length, and the two dimensions are independent.

The implication for RAG systems is direct: retrieved documents add to input length, and if that length includes irrelevant passages (as it typically does), reasoning over the retrieved content degrades even when the relevant information is present. Since Why does vanilla RAG produce shallow and redundant results?, the length degradation explains part of why static retrieval fails — more retrieved documents means more padding means worse reasoning.

A complementary training-time finding complicates this picture. "Longer Context, Deeper Thinking" (2025) shows that models with stronger long-context capacity (128k vs 32k) consistently achieve higher accuracy on mathematical reasoning benchmarks (MATH500 and AIME) — even when test-time inputs are short. Long-context training benefits reasoning as a foundation, not just for processing long inputs. The implication: the inference-time degradation documented in this note coexists with a training-time benefit. Models trained on longer contexts develop better reasoning foundations, but at inference time, longer inputs still degrade performance. The two findings are compatible: long-context training may improve the base reasoning capability, while inference-time input length introduces the noise and distraction effects that degrade it. Source: Arxiv/Evaluations.


Source: Reasoning Logic Internal Rules

Related concepts in this collection

Concept map
22 direct connections · 209 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reasoning performance degrades with input length even far below context window limits