What makes a background condition relevant to a specific reasoning task?
This explores the frame problem in disguise — how a reasoning system decides which of countless background facts and unstated assumptions actually matter for the task in front of it, and why models so often fail to bring the right ones forward.
This explores the frame problem in disguise: out of everything a model could know, what makes one background condition count as relevant to the task at hand? The most direct answer in the corpus is unsettling — language models usually don't fail because they lack the knowledge, they fail because they never surface the right preconditions as constraints. When prompting forces explicit enumeration of unstated assumptions, accuracy jumps from 30% to 85%, which means the knowledge was sitting there the whole time; what was missing was the act of marking it as load-bearing Do language models fail at identifying unstated preconditions?.
So relevance isn't retrieved, it's constructed — and the corpus suggests models construct it from surface form rather than logical structure. Chain-of-thought studies show that training format shapes reasoning strategy roughly 7.5× more than the actual domain, and that structurally invalid prompts work nearly as well as valid ones What makes chain-of-thought reasoning actually work? What makes chain-of-thought reasoning actually work?. Push further and even deliberately corrupted reasoning traces teach as well as correct ones, which implies the trace is computational scaffolding, not a chain of genuinely relevant steps Do reasoning traces need to be semantically correct?. If the content of the 'relevant' conditions can be scrambled without hurting performance, then the model was never really tracking relevance in the human sense.
What does the model track instead? Two things the corpus names directly. First, semantic familiarity: as tasks get harder and working capacity is exceeded, both humans and models fall back on what a statement is *about* rather than its logical form, so plausible-sounding content hijacks the judgment of what matters Do harder reasoning tasks trigger more semantic bias?. Second, sheer position and volume in the context — reasoning accuracy collapses from 92% to 68% with just 3000 tokens of irrelevant padding, far below the context limit, meaning the model can't reliably separate signal from filler even when the filler is obviously off-task Does reasoning ability actually degrade with longer inputs?.
The more interesting lateral finding is that not all parts of a reasoning trace carry equal weight in deciding relevance. 'Thought anchors' — planning and backtracking sentences — act as sparse pivots that steer everything downstream, while most sentences barely matter Which sentences actually steer a reasoning trace?. And at the mechanism level, fewer than 5% of attention heads do the actual work of pulling the relevant fact out of long context; prune them and the model hallucinates even though the information is right there What mechanism enables models to retrieve from long context?. Relevance, in other words, is concentrated in a tiny machinery — which is exactly why it's fragile.
The corpus also points at fixes that sidestep the problem rather than solving it head-on. Interleaving reasoning with external action lets the world itself signal what's relevant, grounding each step in feedback instead of the model's guess Can interleaving reasoning with real-world feedback prevent hallucination?. Memoryless 'Markov-style' decomposition throws out accumulated history so each step only depends on the current sub-problem, which is one way of saying most prior context wasn't relevant anyway Can reasoning systems forget history without losing coherence?. And the deepest cut: what generalizes across reasoning tasks is *procedural* knowledge — broad, transferable patterns of how to operate — not narrow factual recall Does procedural knowledge drive reasoning more than factual retrieval?. Which reframes your whole question. A background condition becomes relevant not by being the right fact, but by fitting a procedure the model has learned to run — and that's also why relevance breaks the moment the task drifts outside the distribution those procedures were learned in Does chain-of-thought reasoning actually generalize beyond training data?.
Sources 12 notes
LLMs struggle not from lacking world knowledge but from failing to bring background conditions forward as relevant constraints. Prompting that forces explicit enumeration of preconditions raises accuracy from 30% to 85%, revealing the frame problem persists in statistical systems.
Research shows training format shapes reasoning strategy 7.5× more than domain, demo position swings accuracy 20%, and invalid CoT prompts work as well as valid ones. CoT is pattern-guided generation, not formal logic.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Content effects intensify as task difficulty increases—from NLI to syllogisms to Wason selection—in both humans and language models. As working capacity is exceeded, both systems fall back on semantic priors instead of logical form.
FLenQA shows reasoning accuracy drops from 92% to 68% at just 3000 tokens of padding, far below context window capacity. The degradation is task-agnostic, uncorrelated with language modeling performance, and persists even with chain-of-thought prompting.
Counterfactual resampling, attention analysis, and causal suppression all identify planning and backtracking sentences as thought anchors—sparse critical points that guide subsequent reasoning. These are functional pivots, not noise.
Less than 5% of attention heads across all model families function as retrieval heads, are intrinsic to short-context models, dynamically activate by context, and are causally necessary for factuality. Pruning them causes hallucination despite information being present in context.
ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.
Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.