Reinforcement Learning for LLMs LLM Reasoning and Architecture

Why does removing spurious cues sometimes hurt model performance?

Most models improve when spurious features are removed, but some fail worse. This note explores whether that failure represents a fundamentally different problem than traditional shortcut learning.

Note · 2026-05-01 · sourced from Linguistics, NLP, NLU
How do reasoning models actually fail under pressure? Where exactly do reasoning models fail and break?

The literature on shortcut learning describes models that latch onto spurious surface features correlated with labels — lexical-overlap heuristics in NLI, sparse heuristic circuits in arithmetic, content effects in syllogistic reasoning. The standard prescription is to remove the spurious feature: take out the cue, performance recovers because the model is forced to use the intended computation.

The Heuristic Override Benchmark shows that this prescription does not apply to its phenomenon. Removing the heuristic cue (the distance "50 meters") makes models worse, not better. Twelve of fourteen models drop in accuracy when the spurious cue is removed. This is the opposite of shortcut-learning predictions and signals that something different is happening.

The authors locate the difference structurally. Shortcut learning is about filtering: the model needs to ignore the spurious feature and attend to the relevant one. Heuristic override is about composing: the model needs to integrate two things — a salient surface cue and an unstated feasibility constraint — and prioritize the constraint when they conflict. Both signals are integral to the problem; neither is noise. Removing the cue does not clean the input; it removes one of the two ingredients the composition requires, leaving the model less able to make any decision at all.

This connects the failure to the classical frame problem rather than to feature-level shortcut learning. The challenge is enumerating which unstated conditions are relevant — not detecting and filtering distractors. The two failure modes need different benchmarks, different mitigations, and different theoretical accounts.


Source: Linguistics, NLP, NLU

Original note title

LLM heuristic override is structurally distinct from shortcut learning because removing the spurious cue degrades rather than improves performance