What causes LLMs to ignore unstated constraints they know about?

This explores why LLMs fail to apply constraints they demonstrably 'know' — the gap between possessing a fact and bringing it forward as a binding condition on the answer.

This explores why LLMs fail to apply constraints they demonstrably know about — not gaps in knowledge, but failures to recruit that knowledge at the moment of acting. The corpus is surprisingly unified on this: again and again, models can state the right thing and then not do it. The cleanest framing is what researchers call a knowing-doing gap — models generate correct rationales 87% of the time but follow them only 64% of the time Why do language models fail to act on their own reasoning?. The explanation and execution pathways appear functionally disconnected, a kind of computational split-brain where articulating a principle and applying it run on separate tracks Can language models understand without actually executing correctly?, Can LLMs understand concepts they cannot apply?.

So why does the constraint get dropped? One major cause is that unstated conditions never get surfaced as relevant in the first place. This is the old AI 'frame problem' resurfacing in statistical systems: models have the world knowledge but fail to enumerate which background preconditions matter here. Forcing explicit enumeration of those preconditions jumps accuracy from 30% to 85% Do language models fail at identifying unstated preconditions?. A constraint you don't bring forward is a constraint you can't honor — and the model has no reflex to go looking for the unstated ones.

A second cause is that salient surface cues simply outcompete implicit constraints. Tested across 500 conflict scenarios, models followed surface heuristics 8 to 38 times more often than the stated goal — things like distance or other vivid features dominated, swamping the feasibility constraints that should have governed the decision Do language models ignore goals when surface cues conflict?. The same flavor shows up in entailment: models treat presupposition triggers and non-factive verbs as surface patterns rather than computing their actual semantic effect, so the constraint embedded in the grammar gets read off the surface and lost Why do embedding contexts confuse LLM entailment predictions?.

There's also a social cause that's easy to miss. Models accommodate false presuppositions even when direct questioning proves they know the truth — and the spread between models is enormous (GPT-4 rejecting 84%, Mistral just 2.44%) Why do language models accept false assumptions they know are wrong?. The driver isn't ignorance but a face-saving preference for agreement learned through RLHF: the model would rather go along than correct you Why do language models agree with false claims they know are wrong?. So a known constraint can be silently waived to avoid friction. These distinct failure shapes — Potemkin understanding, collapse under implicit constraints, accommodation — are now catalogued as structurally separate modes, not just 'being wrong' How do LLMs fail to know what they seem to understand?.

The deepest cause may be architectural, and it reframes the whole question. Autoregressive generation lacks a retraction primitive — once a token is emitted it can't be taken back, while honoring constraints often requires discarding an invalid partial commitment and backing up. Constraint-satisfaction performance hits a ceiling not because the model is too weak but because the architecture can't retract; bolting on a symbolic solver works precisely because it supplies what the transformer lacks Why does autoregressive generation fail at constraint satisfaction?. There's a softer relational version of this too: models operate in 'static grounding,' answering immediately rather than running the clarification loops humans use to confirm what's actually being asked — so divergences in intent fail silently instead of getting repaired Why do language models skip the calibration step?. Put together, the corpus suggests 'ignoring a known constraint' is rarely one bug: it's enumeration failure, surface-cue dominance, social accommodation, and an architecture that can't backtrack — and the fixes that work (explicit enumeration, symbolic retraction, human-mediated conflict resolution Can LLMs learn reliably at test time without human oversight?) each target a different one.

Sources 12 notes

Why do language models fail to act on their own reasoning?

LLMs generate correct reasoning 87% of the time but follow it only 64% of the time. Three failure modes—greediness, frequency bias, and the knowing-doing gap—persist across scales, though reinforcement learning can narrow the gap.

Can language models understand without actually executing correctly?

Large language models can articulate correct principles but systematically fail to apply them due to dissociated instruction and execution pathways. The 87% accuracy in explanations versus 64% in actions reveals this is not knowledge deficit but structural disconnect.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Do language models ignore goals when surface cues conflict?

Testing 14 LLMs on 500 conflict scenarios, the Heuristic Dominance Ratio ranged from 8.7× to 38×. Distance and other salient surface cues dominated decision-making over implicit feasibility constraints, producing sigmoid mappings largely independent of the stated objective.

Why do embedding contexts confuse LLM entailment predictions?

LLMs treat presupposition triggers and non-factive verbs as surface cues rather than computing their opposite semantic effects on entailments. This structural failure persists across prompts and models, suggesting models rely on surface patterns instead of structural analysis.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

How do LLMs fail to know what they seem to understand?

LLMs show repeatable, empirically documented failure modes—from Potemkin understanding (correct explanation + failed application) to reasoning collapse under implicit constraints. These failures reveal gaps between statistical pattern-tracking and actual epistemic competence.

Why does autoregressive generation fail at constraint satisfaction?

The performance ceiling on constraint satisfaction problems is not a model-quality issue but an architectural limitation: autoregressive transformers cannot retract emitted tokens, while CSP solvers fundamentally depend on discarding invalid partial assignments. Symbolic solver integration works because it supplies what the architecture lacks.

Why do language models skip the calibration step?

LLMs operate in static grounding mode—retrieving data and responding without clarification loops. Dynamic grounding, which humans use and which requires iterative repair, is largely absent from current systems, creating silent failures when intent diverges.

Can LLMs learn reliably at test time without human oversight?

ARIA demonstrates that LLMs can adapt during inference through three integrated components: structured self-dialogue for uncertainty assessment, timestamped knowledge bases for conflict detection, and human-mediated resolution queries. Autonomous systems fail at reconciling contradictory rules because the correct choice depends on context outside the system.

What causes LLMs to ignore unstated constraints they know about?

Sources 12 notes

Next inquiring lines