LLM Reasoning and Architecture Language Understanding and Pragmatics Reinforcement Learning for LLMs

Are models actually reasoning about constraints or just defaulting conservatively?

Do language models genuinely apply constraints when solving problems, or do they simply prefer harder options by default? Minimal pair testing reveals whether apparent reasoning success masks hidden biases.

Note · 2026-05-01 · sourced from Linguistics, NLP, NLU
How do reasoning models actually fail under pressure? How do LLMs fail to know what they seem to understand?

The Heuristic Override Benchmark uses minimal pairs — same surface heuristic, with versus without the implicit constraint — to test whether apparent reasoning successes reflect actual reasoning. The result is striking. Twelve of fourteen models perform worse on the no-constraint variant than on the constraint-active variant, with drops up to 38.5 percentage points. Only two models (GPT-OSS-120B at +13.8 and GPT-OSS-20B at +11.0) improve when the constraint is removed.

This exposes a hidden mechanism behind apparent accuracy. When the constraint is present, the correct answer is the harder one (drive to the car wash that is 50m away). When the constraint is removed, the correct answer is the easier one (walk to the store that is 50m away). Models that default to recommending the harder option score correctly on constraint-active cases without doing any constraint reasoning. They are not solving the problem. They are reflexively choosing the more conservative option, which happens to coincide with the constraint-required answer.

The minimal-pair asymmetry is the only test that catches this. Single-instance accuracy looks fine — the model recommended driving, the right answer was driving. But the same model recommends driving even when walking would be correct, because the recommendation is not based on the constraint. The two-of-fourteen models that improve on minimal pairs are the only ones whose constraint-active accuracy reflects genuine reasoning about the constraint. The rest are riding a conservative-bias accident that aggregate metrics cannot distinguish from reasoning.


Source: Linguistics, NLP, NLU

Original note title

Conservative bias hides behind apparent reasoning success — most models perform worse when the constraint is removed than when it is present