How does removing a spurious cue change LLM performance?

This explores what happens when you strip a 'spurious' surface cue out of an LLM's input — and reveals that the answer flips the usual machine-learning intuition on its head.

This explores what happens when you remove a spurious cue from an LLM's input, and the surprising headline is that performance often gets *worse*, not better. In classic shortcut-learning, a spurious cue is a crutch — a shortcut the model leans on instead of doing the real work — so removing it is supposed to force honest reasoning and improve generalization. But on heuristic-override tasks, Why does removing spurious cues sometimes hurt model performance? finds the opposite: yanking the cue degrades the model. The reason is that the model isn't *filtering* a distractor, it's trying to *compose* conflicting signals into one answer. The cue was load-bearing, not decorative. The failure is a frame problem — figuring out which signals matter and how they combine — rather than feature selection. So 'remove the spurious thing and watch it improve' quietly stops being true.

That reframing makes more sense once you see how heavily LLMs lean on surface cues in the first place. Do language models ignore goals when surface cues conflict? tested 14 models on 500 conflict scenarios and found surface features like distance dominated decisions 8 to 38 times more than the actual stated goal — the cue isn't a side input, it's effectively running the show. And Why do embedding contexts confuse LLM entailment predictions? shows models treat even meaning-flipping linguistic constructions as flat surface patterns rather than computing their real semantic effect. If a model's competence is built on cues rather than structure, then removing a cue isn't pruning a bad habit — it's removing part of the scaffolding the answer was standing on.

The flip side is that not all cue-handling is the same problem, and the fixes differ. Why do language models engage with conversational distractors? shows models are decent at 'what to do' instructions but bad at 'what to ignore' instructions — and that this gap closes with surprisingly little training (about 1,080 synthetic dialogues with distractor turns). So *resisting* an irrelevant cue is a trainable skill, while *integrating* a genuinely relevant one (the heuristic-override case) is a harder reasoning demand that more data doesn't obviously solve. The same word — 'cue' — hides two opposite tasks: one you want the model to drop, one you need it to weave in.

The deeper takeaway is that 'spurious' is doing a lot of unexamined work. Whether a cue is noise or signal depends on the task, and LLMs don't reliably tell the difference — which is why the clean shortcut-learning story breaks down here. If you want to follow this somewhere unexpected: the same surface-over-substance pattern shows up in how models get gamed as judges, where fake references and rich formatting inflate scores independent of content quality (Can LLM judges be tricked without accessing their internals?). Across both, the lesson is the same — these systems are exquisitely sensitive to surface cues, so removing or adding one rarely does the simple thing you'd predict.

Sources 5 notes

Why does removing spurious cues sometimes hurt model performance?

Removing spurious cues degrades performance in heuristic override tasks, opposite to shortcut learning predictions. The failure mode is integrating conflicting signals rather than ignoring distractors—a frame problem, not feature selection.

Do language models ignore goals when surface cues conflict?

Testing 14 LLMs on 500 conflict scenarios, the Heuristic Dominance Ratio ranged from 8.7× to 38×. Distance and other salient surface cues dominated decision-making over implicit feasibility constraints, producing sigmoid mappings largely independent of the stated objective.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Why do language models engage with conversational distractors?

Fine-tuning on just 1,080 synthetic dialogues with distractor turns significantly improves topic resilience, revealing that the gap is not model capacity but absent training signal. Models learn to follow what-to-do instructions but not what-to-ignore instructions.

Can LLM judges be tricked without accessing their internals?

Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.

How does removing a spurious cue change LLM performance?

Sources 5 notes

Next inquiring lines