The Model Says Walk: How Surface Heuristics Override Implicit Constraints in LLM Reasoning

Paper · arXiv 2603.29025 · Published March 30, 2026

Large language models systematically fail when a salient surface cue conflicts with an unstated feasibility constraint. We study this through a diagnose–measure–bridge–treat framework. Causal-behavioral analysis of the “car wash problem” across six models reveals approximately context-independent sigmoid heuristics: the distance cue exerts 8.7–38× more influence than the goal, and token-level attribution shows patterns more consistent with keyword associations than compositional inference. The Heuristic Override Benchmark (HOB)—500 instances spanning 4 heuristic × 5 constraint families with minimal pairs and explicitness gradients— demonstrates generality across 14 models: under strict evaluation (10/10 correct), no model exceeds 75%, and presence constraints are hardest (44%). A minimal hint (e.g., emphasising the key object) recovers +15 pp on average, suggesting the failure is in constraint inference rather than missing knowledge; 12/14 models perform worse when the constraint is removed (up to −39 pp), revealing conservative bias. Parametric probes confirm the sigmoid pattern generalises to cost, efficiency, and semantic-similarity heuristics; goal-decomposition prompting recovers +6–9 pp by forcing models to enumerate preconditions before answering. Together, these results characterise heuristic override as a systematic reasoning vulnerability and provide a benchmark for measuring progress toward resolving it.

Large language models are rapidly moving from research tools to everyday decision-support systems. People consult them for travel planning, medical triage, legal interpretation, financial advice, and moral judgment (Cheung et al., 2025; Echterhoff et al., 2024; Omar et al., 2024). As the scope of LLM-assisted decision-making widens, so does the potential for harm when the model’s reasoning is flawed in ways that are difficult to anticipate. Unlike factual hallucinations, which can in principle be verified against external knowledge, reasoning errors—cases where the model draws an incorrect conclusion from correctly perceived premises—are harder to detect because the output sounds plausible and internally consistent.

A growing body of work documents shortcut learning—models exploiting surface-level statistical regularities rather than performing the intended computation (Geirhos et al., 2020; Du et al., 2023)—across NLI (McCoy et al., 2019), QA (Ko et al., 2020), mathematical reasoning (Shi et al., 2023; Mirzadeh et al., 2024; Yang et al., 2025), and arithmetic (Nikankin et al., 2024; Branco et al., 2021). Cognitive-bias analogues (anchoring, framing, representativeness, content effects) further compound the problem (Suri et al., 2024; Binz & Schulz, 2023; Bubeck et al., 2023;Wang et al., 2024; Malberg et al., 2025; Echterhoff et al., 2024; Lampinen et al., 2024), and can amplify human biases when users defer to model recommendations (Cheung et al., 2025). Yet this literature overwhelmingly measures shortcut reliance through accuracy—a binary signal that reveals that the model fails but not why.

A recent viral test crystallized this gap with striking clarity. In February 2026, a Mastodon user posed a single-sentence question to four frontier LLMs (K´evin (@knowmadd), 2026): “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” Every model recommended walking; the correct answer is to drive, because you cannot wash a car that is not at the car wash. The question went viral (Allen, 2026), and a subsequent 53-model evaluation found that 42 recommended walking on a single pass, with only 5 answering correctly across ten trials (Opper AI, 2026).

The problem is diagnostic because it is simple: no specialised knowledge, no multi-step arithmetic, no ambiguous premises—just a conflict between a surface heuristic (short distance ⇒ walk) and an implicit constraint (the car must be co-located with the wash). This conflict structure recurs whenever an unstated prerequisite competes with a statistically dominant surface pattern, from medical triage (“mild symptom⇒wait”) to legal reasoning (“standard clause ⇒ sign”). Jo (2026) connects the failure to the classical frame problem (McCarthy & Hayes, 1981) and shows that structured prompting can raise single-model accuracy from 30% to 85%, confirming that the bottleneck is not missing information but the order and structure of processing. However, no prior study has provided a systematic analysis that (i) identifies which surface features trigger the heuristic, (ii) measures how robustly it persists under controlled perturbation, or (iii) characterises the reasoning traces that distinguish correct from incorrect responses.

“5 min away,” “next door”

“quickest way,” “saves time”

“free option,” “saves money”

“gas station” for tires

Example

Car must be at car wash

Can’t carry sofa on foot

Can’t drive w/ flat tire

Gas station won’t fix tires

Store is already closed

Causal occlusion. Three findings emerge from span-level perturbation (Figure 1; Table 7 in Appendix E). First, perturbing the distance span shifts every model toward Drive (Δs from −1.2 to −30.3), consistently across all three operators. Second, perturbing the goal produces near-zero or positive effects—for Qwen3-4B, neutral goal replacement yields Δs = +7.5, making Walk more likely when the constraint is removed. Third, the Heuristic Dominance Ratio (HDR) ranges from 8.7× to 38.0×: the distance cue is at least an order of magnitude more influential than the goal. HDR decomposition (Figure 2, left) shows that goal sensitivity is fragile across paraphrases (6.4× range) while distance sensitivity is stable (2.3×).

Token-level attribution. Sentence-level masking confirms |Δsdistance| > |Δsquestion| > |Δsgoal| for every model. Token-level masking within the goal span (Appendix E) reveals why: washing-action tokens weakly favour Drive, while “car” and “vehicle” favourWalk; the opposing effects cancel. The largest token effect (|Δs| = 5.8) is 5× smaller than the distance effect (30.3), a pattern more consistent with keyword-level associations than compositional inference.

Monotonicity curves. All six models produce sigmoid conflict curves tracking the control (Figure 3), differing only in amplitude (| ¯s|: < 5 to > 25) and crossover distance (800 m–3 km). This universality indicates a shared heuristic pattern: every model maps distance to decision in an approximately goal-independent manner. Even Qwen3.5-27B, which shows the strongest goal modulation (offset −13.4), merely shifts the sigmoid downward without changing its shape—the goal weakly modulates but never gates the decision.

Notably, stronger heuristic cues do not reliably produce lower accuracy: mean strict accuracy is 62.8% for strong, 56.2% for medium, and 59.6% for weak cues (Appendix F.5). This non-monotonic pattern suggests the failure is not simply a matter of being overwhelmed by a strong signal; even weak heuristic cues suffice to disrupt constraint inference, consistent with the bottleneck being in activating the constraint reasoning pathway rather than in competition between heuristic and constraint signals.

The explicitness gradient reveals an inference bottleneck: accuracy jumps +15.3 pp on average (59.2% → 74.5%) from a single subtle hint (e.g., adding emphasis: “get my car washed,” drawing attention to the object that must be present), suggesting models can access the relevant knowledge under facilitated conditions but fail to activate it autonomously. The minimal-pair asymmetry exposes conservative bias: 12 of 14 models perform worse when the constraint is removed (drops up to −38.5 pp), revealing that many “correct” base answers default to the harder option rather than reasoning about the constraint. Only GPT-OSS-120B (+13.8) and GPT-OSS-20B (+11.0) improve on pairs, consistent with genuine reasoning.

4 Discussion

Unified account. Across four studies, a coherent picture emerges: LLMs apply approximately context-independent heuristic mappings that dominate over implicit goal constraints. Study 1 identifies the pattern (HDR: 8.7–38×); Study 2 demonstrates generality (14 models, no model above 75%); parametric probes confirm the sigmoid extends beyond proximity; and goal-decomposition prompting (+6–9 pp) is consistent with an inference-order bottleneck.

Inference bottleneck. The +15.3 pp explicitness gradient and token-level analysis suggest models possess the relevant world knowledge but fail to activate it unless explicitly cued. Goal-decomposition prompting supports this: forcing precondition enumeration before the heuristic fires converts an implicit constraint into a self-generated hint.

Conservative bias. The minimal-pair asymmetry (12/14 models worse when the constraint is removed, up to −38.5 pp) shows that accuracy on constraint-active instances alone overestimates genuine reasoning. This finding underscores that minimal pairs are essential for any benchmark targeting constraint-sensitive reasoning.

Distinction from shortcut learning. Unlike shortcut learning (Geirhos et al., 2020), where a spurious feature is removed to fix performance, and distractibility (Shi et al., 2023), where extraneous noise is filtered, our setting requires composing two integral prompt components: an unstated constraint must override a statistically dominant cue. Our minimal-pair results confirm the distinction: removing the heuristic cue makes models worse, not better—the opposite of the shortcut learning prediction. This connects to the classical frame problem (McCarthy & Hayes, 1981): the challenge is enumerating which unstated conditions are relevant, not filtering noise.

Deployment implications. This failure is invisible to standard evaluation: models produce fluent, confident, wrong responses. In domains where unstated constraints compete with salient surface features—medical triage, legal reasoning, financial planning—the same pattern can produce systematically incorrect recommendations.

5 Related Work

Shortcut Learning and Heuristic Reliance. Neural models routinely exploit shortcuts— spurious cues correlated with labels but unrelated to intended reasoning (Geirhos et al., 2020; Du et al., 2023)—from lexical-overlap heuristics in NLI (McCoy et al., 2019; Gururangan et al., 2018) to sparse heuristic circuits in arithmetic (Nikankin et al., 2024) and cognitive biases in LLM reasoning (Wang et al., 2024; Lampinen et al., 2024). This persists in generative settings: larger models can exploit ICL shortcuts more (Tang et al., 2023), RLHF introduces task– feature–label correlations (Sun et al., 2024), and no model is universally robust (Yuan et al., 2024; Zhou et al., 2024). However, prior work targets feature-level shortcuts in classification. We focus on reasoning-level heuristic shortcuts—pre-trained templates (“short distance → walk”) that override implicit goal-feasibility constraints in open-ended decisions.

Distractibility and Constraint-Following. Distractor benchmarks (Shi et al., 2023; Mirzadeh et al., 2024; Yang et al., 2025) inject additive noise into self-contained problems, requiring models to filter extraneous information. Constraint benchmarks (Zhou et al., 2023; Chen et al., 2025; Song et al., 2026) test compliance with stated or domain-specific rules. Our setting differs: both the heuristic cue and the hidden constraint are integral to the prompt, so the model must prioritise competing signals—inferring and enforcing a feasibility constraint that is never stated, must be derived from world knowledge, and competes with a salient heuristic.

Commonsense Reasoning and the Frame Problem. Commonsense benchmarks (Levesque et al., 2012; Bisk et al., 2020; Zellers et al., 2019; Clark et al., 2018) test whether models possess world knowledge. We test a complementary failure: models that possess the knowledge yet err because a surface heuristic overpowers it, connecting to the classical frame problem (Mc- Carthy & Hayes, 1981). The car wash problem was tested across 53 models (Opper AI, 2026) (5 consistently correct); structured prompting raises accuracy from 30% to 85% but impedes self-correction (Jo, 2026). We generalise these single-instance observations into a systematic benchmark: 500 instances crossing four heuristic families with five constraint families, evaluated across 14 models.

Diagnostic Methodology. Our causal analysis builds on perturbation-based attribution (Zeiler & Fergus, 2014; Ribeiro et al., 2016; Lundberg & Lee, 2017) and counterfactual evaluation (Kaushik et al., 2019), mitigating distribution-shift concerns (Hooker et al., 2019) via multiple replacement operators with agreement requirements. Unlike mechanistic interpretability (Marks et al., 2024; Conmy et al., 2023; Geiger et al., 2021), which targets internal circuits and representations, our approach operates at the input–output level via causal perturbation, applying to API-only systems. We use “causal” in the interventionist sense throughout: we measure the effect of controlled input perturbations on output decisions, which supports behavioral characterisation but not claims about internal mechanisms. Following Singh et al. (2024), we use attribution to characterise the behavioral pattern behind a systematic error; the benchmark’s built-in minimal pairs and controlled gradients serve as counterfactual probes beyond aggregate accuracy.

6 Conclusion

When salient surface cues conflict with unstated feasibility constraints, LLMs systematically follow the heuristic. We trace this failure from behavioral pattern (approximately context-independent sigmoid heuristics, HDR up to 38×) to generality (no model above 75% strict accuracy across 14 models on the 500-instance HOB benchmark). The explicitness gradient suggests the bottleneck is constraint inference rather than missing knowledge; the minimal-pair asymmetry reveals that many apparent successes mask conservative bias. A simple goal-decomposition prompt—forcing models to enumerate preconditions before answering—recovers +6–9 pp, consistent with the failure being in processing order and offering an initial mitigation direction for future work. We release the HOB benchmark and diagnostic framework to support systematic measurement of progress on this challenge.