Why do structure-targeted training negatives fail to fix the underlying problem?

This explores why adding 'hard negatives' aimed at structural or compositional distinctions during training tends to paper over a problem rather than solve it — and what the corpus says the real problem actually is.

This reads the question as being about a recurring pattern in training: when a model can't tell apart structurally distinct inputs, the intuitive fix is to feed it negative examples that target exactly that structure. The corpus suggests this fails not because the negatives are badly chosen, but because the failure they're aiming at usually isn't a discrimination problem in the first place.

The clearest case is dense retrieval. Adding structure-targeted negatives to teach a retriever compositional sensitivity consistently *degrades* zero-shot performance — an 8–40% drop — while only partially improving the discrimination you were after Does training for compositional sensitivity hurt dense retrieval?. The authors frame this as a geometric trade-off baked into high-dimensional cosine space, not a tuning knob you haven't found yet. Pushing the embedding geometry to separate near-identical structures pulls it away from the smooth similarity landscape that makes retrieval generalize. The negative does its local job and breaks the global one.

A deeper reason shows up when you ask what the model represents at all. LLMs handle simple sentences well but fail predictably as syntactic depth and embedding increase — misreading nested clauses and complex nominals — because they learned surface heuristics rather than structural grammar rules Why do large language models fail at complex linguistic tasks? Does LLM grammatical performance decline with structural complexity?. If the underlying representation never encoded structure, negatives that target structure are asking the model to discriminate on an axis it doesn't have. It will instead find some surface proxy that happens to separate your examples — which is exactly the shortcut you were trying to eliminate.

That shortcut dynamic is its own trap. Training on too-hard or near-impossible samples drives models toward degenerate shortcuts that then contaminate skills they already had, because rare accidental successes get reinforced as if they were sound reasoning Do overly hard RLVR samples actually harm model capabilities?. And there's a more fundamental misframing: some tasks fail because the model must *integrate* conflicting signals, not filter distractors out. Removing or targeting cues actively hurts in 'heuristic override' settings — it's a composition problem, a frame problem, not feature selection Why does removing spurious cues sometimes hurt model performance?. A negative example teaches 'avoid this,' but composition needs 'combine these,' and the two aren't the same operation.

So what does work points back at the diagnosis. Negatives help when the failure really is suppressible noise: negative reinforcement alone can match full RL by suppressing wrong trajectories while preserving diversity Does negative reinforcement alone outperform full reinforcement learning?, and explicit negative pairs fix rigid format errors in function calling where plain imitation underperforms Can small models match large models on function calling?. But where the problem is structural understanding, engagement beats discrimination — training models to critique flawed responses builds deeper understanding than imitating correct ones Does critiquing errors teach deeper understanding than imitating correct answers?, and self-correction only sticks when the model practices on its *own* error distribution online rather than offline traces that don't match what it actually gets wrong Why does self-correction training on offline data fail?. The throughline: structure-targeted negatives treat a representational gap as if it were a labeling gap, and the corpus keeps showing those are different diseases.

Sources 9 notes

Does training for compositional sensitivity hurt dense retrieval?

Adding structure-targeted negatives to dense retrieval training consistently degrades zero-shot performance (8-40% nDCG@10 drop) while only partially improving compositional discrimination. This is a geometric trade-off in high-dimensional cosine spaces, not a tuning problem.

Why do large language models fail at complex linguistic tasks?

Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Does critiquing errors teach deeper understanding than imitating correct answers?

Training models to critique noisy responses outperforms training on correct answers because critique forces engagement with failure modes and structural reasoning. Even imperfect critique supervision beats correct-answer imitation, showing how weak surface-pattern learning is for building genuine understanding.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Why do structure-targeted training negatives fail to fix the underlying problem?

Sources 9 notes

Next inquiring lines