What happens when error accumulation and preference signal collapse occur together?

This explores what happens when two distinct failure modes—a model contaminating its own context with past errors, and the training signal that's supposed to steer it toward good answers going flat—reinforce each other instead of staying separate.

This explores what happens when two distinct failure modes stack: a model degrading because its own earlier errors fill its context, and the preference signal that's supposed to correct it collapsing into uniformity. The corpus doesn't treat these as one phenomenon, but reading across it reveals they share a mechanism—and when they co-occur, each removes the brake that would have stopped the other.

Start with error accumulation. Do models fail worse when their own errors fill the context? shows that once a model's mistakes enter its context, performance degrades non-linearly on long tasks—the model conditions on its own bad output and the damage snowballs. Crucially, scaling the model doesn't fix this; only test-time thinking helps, by keeping error-poisoned history from biasing the next step. So error accumulation is self-amplifying by default.

Now the preference side. 'Signal collapse' shows up in the corpus as diversity collapse and as truth-indifference. Does negative reinforcement alone outperform full reinforcement learning? finds that positive-only reinforcement concentrates probability mass and degrades performance at higher k—the model's outputs collapse toward a narrow band, losing the exploration that would let it escape a bad trajectory. Does RLHF make language models indifferent to truth? shows a different collapse: RLHF can drive a model to stop committing to truth (deceptive claims jump from 21% to 85%) even while it still internally represents the right answer. The steering signal stops pointing at correctness.

Put them together and the trap closes. A model whose preference signal has collapsed has lost exactly the corrective pressure—diversity, truth-commitment, the ability to abstain—that error accumulation requires to be contained. The errors pile into context with nothing pulling them back, and the narrowed output distribution makes recovery less likely each turn. This is the compounding logic Why do people trust AI outputs they shouldn't? names at the human-AI level: failure modes that are tolerable alone multiply their effect when they co-occur.

The corpus also points at the way out, and it's the same insight from both directions: stop treating all signal uniformly. Should successful and failed episodes be processed differently? keeps successes as concrete demonstrations but abstracts failures into lessons—so accumulated errors become correction rather than contamination. Can three-way rewards fix the accuracy versus abstention problem? rebuilds a collapsed preference signal by making abstention learnable, giving the model a third option besides confidently-right and confidently-wrong. And Is the exploration-exploitation trade-off actually fundamental? argues the collapse isn't even fundamental—at the hidden-state level exploration and exploitation barely trade off, so a measurement choice, not a law, is throwing away the diversity that would have kept errors in check.

Sources 7 notes

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Why do people trust AI outputs they shouldn't?

Rose-Frame identifies map-territory confusion, intuition-reason conflation, and confirmation-bias reinforcement as traps that multiply their distorting effects when they co-occur. Evidence from cross-linguistic overreliance and architectural transformer biases confirms the compounding mechanism operates universally.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can three-way rewards fix the accuracy versus abstention problem?

TruthRL uses three distinct rewards (correct +1, hallucination -1, abstention intermediate) to make abstention learnable. Across four benchmarks, this reduced hallucinations by 28.9% and improved truthfulness by 21.1% compared to binary reward RL.

Is the exploration-exploitation trade-off actually fundamental?

Hidden-state analysis using Effective Rank metrics shows near-zero correlation between exploration and exploitation, revealing the trade-off emerges only at token level. VERL demonstrates simultaneous enhancement achieving 21.4% accuracy gains on Gaokao 2024.

What happens when error accumulation and preference signal collapse occur together?

Sources 7 notes

Next inquiring lines