Can structured prompting reliably force models to enumerate preconditions?

This explores whether prompts that explicitly demand models list out the hidden conditions a task depends on actually work — and where that forcing function quietly breaks down.

This explores whether structured prompting can reliably force models to enumerate preconditions — the unstated background conditions a task depends on. The short answer the corpus suggests: yes, it works strikingly well as a forcing function, but "reliably" hides some traps worth knowing about. The headline result is dramatic. When models are made to explicitly surface the relevant unstated preconditions before answering, accuracy on what's been called the "modern frame problem" jumps from roughly 30% to 85% Do language models fail at identifying unstated preconditions?. The interesting part is the diagnosis: the failure was never about missing world knowledge. Models *know* the background conditions — they just don't bring them forward as relevant constraints unless prompted to. Enumeration prompting closes that gap.

A second line of work shows this generalizes beyond preconditions to reasoning structure more broadly. Borrowing Toulmin's argument model, forcing models to name the warrants and backing behind a claim — the implicit premises plain chain-of-thought lets them skip — catches reasoning failures that standard prompting waves through Can structured argument prompts make LLM reasoning more rigorous?. The common thread across both: models default to a kind of fluent shortcutting, and explicit structure is what drags the skipped steps into the open.

Here's the catch that keeps "reliably" honest. Apparent success at constraint-reasoning can be an illusion. When researchers *removed* constraints from problems, twelve of fourteen models got *worse* — dropping up to 38.5 points — which means they'd been exploiting a conservative bias (defaulting to the harder, safer option) rather than actually evaluating the constraints Are models actually reasoning about constraints or just defaulting conservatively?. A model that enumerates preconditions in its output isn't necessarily *using* them. Related work shows the reasoning a model performs internally can get computed in early layers and then overwritten by format-compliant filler before it reaches the output Do transformers hide reasoning before producing filler tokens? — so the visible enumeration and the actual computation can come apart.

There are also hard ceilings on what any prompt can do. Prompting only reorganizes knowledge already in the model; it can't inject what was never there Can prompt optimization teach models knowledge they lack?. And when a precondition contradicts a strong training-time association, textual prompting alone often can't override the prior — the parametric knowledge wins, and only intervention in the model's internal representations reliably fixes it Why do language models ignore information in their context?. So structured prompting is best understood as an *activation* tool: it reliably surfaces preconditions the model latently knows and would otherwise skip, which is exactly the frame-problem case — but it doesn't manufacture missing knowledge, can't always beat strong priors, and the enumeration you see isn't proof the model reasoned with it.

Sources 6 notes

Do language models fail at identifying unstated preconditions?

LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.

Can structured argument prompts make LLM reasoning more rigorous?

Applying Toulmin's argument model as explicit prompting steps (CQoT) improves LLM reasoning by forcing models to identify warrants and backing rather than skipping implicit premises. The method catches failures that standard chain-of-thought prompting allows.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can structured prompting reliably force models to enumerate preconditions?

Sources 6 notes

Next inquiring lines