What causes reasoning quality to degrade during long research tasks?
This explores why a reasoning model's quality drops over the course of a long task — and the corpus says the culprit isn't running low on compute but several distinct failure modes that compound as length grows.
This explores why a reasoning model's quality drops over the course of a long task. The most counterintuitive finding in the corpus is that more thinking is not more reasoning. Accuracy follows an inverted-U: it climbs to a sweet spot, then falls. One study watched benchmark accuracy slide from 87% down to 70% as thinking tokens grew from ~1,100 to ~16,000 Does more thinking time always improve reasoning accuracy?, and the optimal chain length actually shrinks as a model gets more capable — simplicity is something reward signals push toward, not a limitation Why does chain of thought accuracy eventually decline with length?. So part of the answer is simply that long tasks invite overthinking past the point of diminishing returns.
The mechanism behind that decline is sneaky. Extended thinking doesn't reason better — it samples wider. Longer traces help only by widening the output distribution so it happens to cover the right answer more often; push past the threshold and the distribution gets too diffuse and accuracy collapses Does extended thinking actually improve reasoning or just increase variance?. That same variance shows up as self-revision errors and inflated output noise When does thinking too much actually hurt reasoning?. In other words, the model isn't thinking its way to a worse answer so much as scattering.
A second family of failures is structural, not quantitative. Reasoning models 'wander like tourists, not scientists' — they explore invalid branches and abandon promising paths mid-stream before finishing them Why do reasoning models abandon promising solution paths?. This premature path-switching is common enough that simply penalizing thought-transition tokens at decoding time — no retraining — recovers accuracy on hard math Do reasoning models switch between ideas too frequently?. The fact that a cheap intervention works tells you the better answer was reachable all along; the model just bailed too early.
Then there's the length of the *input* itself, separate from the length of the thinking. Padding a problem with irrelevant context tanks reasoning from 92% to 68% at just 3,000 tokens — far below the context window, and chain-of-thought prompting doesn't rescue it Does reasoning ability actually degrade with longer inputs?. Long research tasks accumulate exactly this kind of distracting material. And when a task is hard or under-specified, models fall back on semantic priors instead of logic Do harder reasoning tasks trigger more semantic bias?, or churn out redundant reasoning because they were trained to produce steps but never taught when to stop or to flag an ill-posed question Why do reasoning models overthink ill-posed questions?.
Two notes reframe the whole problem worth carrying away. First, breakdowns track *unfamiliarity* more than complexity — models pattern-match to instances they've seen, so a long chain succeeds or fails based on whether it resembles training data, not on its length per se Do language models fail at reasoning due to complexity or novelty?. Second, the same thinking mechanism can help or hurt depending on training: vanilla models use extended thinking to spiral into self-doubt, while RL training redirects it into productive analysis Does extended thinking help or hurt model reasoning?. The practical upshot is that quality on long tasks is best protected by verifying the *process* as it unfolds — checking intermediate states rather than only the final answer lifted task success from 32% to 87%, because most failures are process violations that final-answer scoring never sees Where do reasoning agents actually fail during long traces?.
Sources 12 notes
Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.
Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.
Longer thinking traces improve accuracy through variance expansion—broader output distributions cover correct answers more often—not through better reasoning. Beyond a critical threshold, the distribution becomes too diffuse and accuracy drops, revealing the mechanism is sampling coverage, not genuine reasoning improvement.
Empirical studies demonstrate non-monotonic scaling in test-time reasoning: accuracy peaks at a critical thinking-token count, then declines sharply (87.3% to 70.3% as tokens scale from 1,100 to 16,000). Extended thinking inflates output variance and introduces self-revision errors rather than improving solution quality.
Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Content effects intensify as task difficulty increases—from NLI to syllogisms to Wason selection—in both humans and language models. As working capacity is exceeded, both systems fall back on semantic priors instead of logical form.
Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.
Vanilla models use thinking mode counterproductively, inducing self-doubt that degrades performance. RL training reverses this, transforming the same mechanism into beneficial gap analysis. Training mediates reasoning quality, not just quantity.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.