What mechanisms cause overly hard samples to degrade prior model performance?

This explores why training on problems that are too hard for a model can actively make it worse — what's actually breaking inside the model, not just whether performance drops.

This explores the mechanisms behind a counterintuitive failure: feeding a model problems beyond its reach doesn't just waste training, it can corrode capabilities it already had. The corpus points to several distinct culprits, and they're worth separating because they suggest different fixes.

The most direct mechanism is reward gaming under reinforcement learning. When problems are nearly impossible, a model almost never solves them honestly — so the rare accidental successes get treated as gold. Group-relative advantage normalization amplifies these flukes into high-value training signal, teaching the model to repeat answers and skip computation rather than reason. Crucially, these degenerate shortcuts don't stay contained; they bleed into and contaminate pre-existing skills Do overly hard RLVR samples actually harm model capabilities?. A related collapse happens at the distribution level: RL tends to converge on a single dominant pretraining format within the first epoch and suppress the alternatives, narrowing the model's range — and which format wins depends on scale, not on which one performs best Does RL training collapse format diversity in pretrained models?.

A second mechanism is that 'too hard' isn't a fixed property of the sample — it's relative to where the model currently is. A sample's teaching value comes from the interaction between its difficulty and the model's ability, and the productive band of medium-difficulty problems drifts during training, sometimes within a few steps How does model ability change what samples teach?. So a sample that's merely challenging early can become genuinely degrading later, which is exactly why static difficulty filters go stale. This connects to the older data-pruning literature, where ranking examples by difficulty lets you beat power-law scaling — but the catch is that the right examples to keep depend on how much data and capability you already have Can we prune training data without hurting model performance?.

Third, there's a self-reinforcing contamination channel that operates at inference but compounds during multi-step work: once a model's own errors fill its context, it conditions on those errors and fails worse, non-linearly, over long horizons. Scaling the model doesn't fix it — only test-time 'thinking' compute reduces it by keeping the bad context from biasing reasoning Do models fail worse when their own errors fill the context?. Hard samples that produce lots of wrong intermediate steps feed this loop directly.

Underlying all of this is a quieter mechanism: fine-tuning can damage the substrate where knowledge lives. Direct weight updates corrupt knowledge storage in lower layers, whereas decoding-time proxy-tuning leaves base weights untouched and preserves far more Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The same theme appears as KL drift: models pushed far from their base distribution lose plasticity and stall when domains change, while staying close preserves the ability to keep learning Does staying close to the base model preserve learning ability?. The unifying picture is that overly hard samples push the model hard and in the wrong direction at once — large drift toward degenerate strategies — which is precisely the combination that overwrites what was already working. One caveat worth carrying: not every difficulty-induced change is damage. Under out-of-distribution load, models sparsify their activations in a systematic way that actually stabilizes performance Do language models sparsify their activations under difficult tasks?, and removing 'spurious' cues can hurt rather than help when the real task is integrating conflicting signals Why does removing spurious cues sometimes hurt model performance? — so the line between productive difficulty and destructive difficulty is exactly what makes this hard to manage.

Sources 9 notes

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

How does model ability change what samples teach?

A sample's learning value depends on the interaction between its difficulty and the model's current ability, not difficulty alone. The productive band of medium-difficulty problems drifts during training, making static difficulty estimates obsolete within steps.

Can we prune training data without hurting model performance?

Research shows that ranking training examples by difficulty (EL2N, forgetting, memorization) and removing easy ones beats power-law scaling laws. On CIFAR-10, 50% of data was pruned without accuracy loss, and self-supervised metrics scaled the approach to ImageNet.

Do models fail worse when their own errors fill the context?

Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

What mechanisms cause overly hard samples to degrade prior model performance?

Sources 9 notes

Next inquiring lines