How do prior errors in context history amplify future mistakes in long tasks?
This explores the self-conditioning trap — how a model's own earlier mistakes, once they sit in its context window, bias it toward making more mistakes as a task runs long, and what (if anything) breaks the loop.
This explores the self-conditioning trap: when a model's own earlier mistakes sit in its context, they don't just stay there inertly — they bias the next step toward repeating the pattern, so errors compound rather than wash out. The clearest statement of this is the finding that models degrade *non-linearly* once prior errors contaminate their context history Do models fail worse when their own errors fill the context?. The unsettling part isn't that a long task accumulates slip-ups — it's the feedback loop: a wrong token becomes the conditioning signal for the next wrong token. And scaling the model up doesn't rescue you. The thing that helps is test-time compute — 'thinking' models that work out a fresh line of reasoning instead of letting the error-soaked transcript steer them.
Why would a model treat its own past output as a cue to keep going wrong? Two adjacent notes give the mechanism. First, a chunk of chain-of-thought errors turn out to be *local* memorization — the model predicting the next token mostly from the immediately preceding tokens rather than from the actual problem, and this dominates as complexity rises Where do memorization errors arise in chain-of-thought reasoning?. That's exactly the substrate self-conditioning needs: if recent tokens drive the next token, then recent *wrong* tokens drive the next wrong token. Second, models often fail to integrate what's actually in their context when a strong prior pulls the other way — context gets overridden, not absorbed Why do language models ignore information in their context?. Put those together and you see the trap from both ends: the model over-trusts its recent local output and under-trusts the corrective signal.
What's striking is that the rot sets in absurdly early. Reasoning accuracy can fall from 92% to 68% with just a few thousand tokens of *padding* — long before any context-window limit, and even with chain-of-thought prompting Does reasoning ability actually degrade with longer inputs?. So 'long task' doesn't mean 'near the memory ceiling.' Mere length is corrosive on its own, and errors arriving inside that length amplify the effect. One line of work even reframes the long-context bottleneck as a *compute* problem rather than a memory one — the model can't consolidate everything it's seen into usable internal state fast enough Is long-context bottleneck really about memory or compute?, which is the flip side of why pure scaling doesn't fix self-conditioning while extra test-time deliberation does.
The more interesting turn is what the corpus says about *escaping* the loop, because the fixes are mostly about how you treat past failures rather than how big the model is. The most counterintuitive: most long-trace failures are process violations, not wrong final answers — so verifying intermediate steps as they're generated lifted task success from 32% to 87%, catching errors that final-answer scoring sails right past Where do reasoning agents actually fail during long traces?. In other words, you intercept the bad token before it becomes next turn's conditioning signal. Agent-memory approaches attack the same problem from the storage side: Reflexion has agents write a verbal self-diagnosis after a failure and keep it as episodic memory, so the *lesson* persists while the contaminating transcript doesn't Can agents learn from failure without updating their weights?. SkillRL sharpens this into an asymmetry that mirrors human experts — store successes as concrete demonstrations, but compress failures into abstracted lessons rather than replaying them verbatim — which both saves context and dodges the degradation that uniform 'keep everything' consolidation produces Should successful and failed episodes be processed differently?.
The thread worth leaving with: the danger isn't that the model fails once — it's that a raw failure left sitting in context becomes the prompt for the next failure. So the techniques that work all do the same thing in different costumes — keep errors from re-entering the reasoning stream as conditioning signal. Verify mid-process and stop the bad step early Where do reasoning agents actually fail during long traces?; abstract a failure into a lesson instead of replaying the wreckage Should successful and failed episodes be processed differently?; or spend test-time compute to reason fresh rather than inherit a poisoned transcript Do models fail worse when their own errors fill the context?. A bigger model won't save you; a model that doesn't let its own mistakes drive the next token will.
Sources 8 notes
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.