How do LLMs fail to know what they seem to understand?

Catalogues how LLMs fail epistemically: specific, recurring ways they track language without genuine understanding.

Topic Hub · 34 linked notes · 5 sections

View as

Epistemic Failure Modes

9 notes

Can LLMs understand concepts they cannot apply?

Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.

Can language models understand without actually executing correctly?

Do LLMs truly comprehend problem-solving principles if they consistently fail to apply them? This explores whether the gap between articulate explanations and failed actions points to a fundamental architectural limitation.

Do language models actually use their encoded knowledge?

Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.

Why do language models ignore information in their context?

Explores why language models sometimes override contextual information with prior training associations, and whether providing more context can solve this problem.

Do foundation models learn world models or task-specific shortcuts?

When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?

Are LLM emergent abilities real or measurement artifacts?

Do large language models develop sudden new capabilities at certain scales, or do discontinuous metrics just make gradual improvements look sudden? This matters because it changes how we predict and interpret model behavior.

Does LLM forgetting mean knowledge loss or alignment loss?

When language models lose performance on old tasks after learning new ones, is the underlying knowledge actually erased, or does the model simply lose its ability to apply it? Understanding this distinction could reshape how we think about AI safety and continual learning.

Why do accurate predictions lead to poor decisions?

Predictive models are built to fit data, not to optimize decision outcomes. This note explores when and why accurate forecasts fail to produce good choices.

Do LLMs actually have world models or just facts?

The term 'world model' conflates two different capabilities: factual representation versus mechanistic understanding. Understanding which one LLMs actually possess matters for assessing their reasoning reliability.

Heuristic Override and Implicit Constraint Failure

4 notes

Why do language models fail to use knowledge they possess?

Large language models contain relevant world knowledge but often fail to activate it without explicit cues. This explores whether the bottleneck lies in knowledge storage or in the inference process that decides what background facts apply.

Are models actually reasoning about constraints or just defaulting conservatively?

Do language models genuinely apply constraints when solving problems, or do they simply prefer harder options by default? Minimal pair testing reveals whether apparent reasoning success masks hidden biases.

Why does removing spurious cues sometimes hurt model performance?

Most models improve when spurious features are removed, but some fail worse. This note explores whether that failure represents a fundamentally different problem than traditional shortcut learning.

Do language models fail at identifying unstated preconditions?

When LLMs ignore background conditions needed for reasoning, is this a knowledge problem or an enumeration problem? Understanding what causes these failures could improve how we prompt and evaluate reasoning.

Reasoning and Inference Failures

9 notes

Do large language models reason symbolically or semantically?

Can LLMs follow explicit logical rules when those rules contradict their training knowledge? Testing whether reasoning operates independently of semantic associations reveals what computational mechanisms actually drive LLM multi-step inference.

How much does the order of premises actually matter for reasoning?

When you rearrange the order of logical premises in a deduction task, does it change how well language models can solve it? This tests whether LLMs reason abstractly or process input sequentially.

How does multi-hop reasoning develop during transformer training?

Does implicit multi-hop reasoning emerge gradually through distinct phases? This explores whether transformers move from memorization to compositional generalization, and what internal mechanisms enable that shift.

Why do LLMs fail at simple deductive reasoning?

LLMs excel at complex multi-hop reasoning across sentences but struggle with trivial deductions humans find obvious. What explains this counterintuitive reversal in capability?

Why do LLMs accept logical fallacies more than humans?

LLMs fall for persuasive but invalid arguments at much higher rates than humans. This explores whether reasoning models genuinely evaluate logic or simply mimic argument structure.

Do large language models use one reasoning style or many?

Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.

Can models identify what information they actually need?

When a reasoning task is missing a key piece of information, can language models recognize what's absent and ask the right clarifying question? QuestBench tests this capability directly.

Why do language models fail to act on their own reasoning?

LLMs generate correct step-by-step reasoning 87% of the time but only follow through with matching actions 64% of the time. What drives this gap between knowing and doing?

Do autonomous agents report success when actions actually fail?

Explores whether agents systematically claim task completion despite failing to perform requested actions, and why this matters more than simple task failure for real-world deployment safety.

NLI and Entailment Failures

4 notes

Do LLMs predict entailment based on what they memorized?

Explores whether language models make entailment decisions by recognizing memorized facts about the hypothesis rather than reasoning through the logical relationship between premise and hypothesis.

Does fine-tuning on NLI teach inference or amplify shortcuts?

When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.

Why do language models fail confidently in specialized domains?

LLMs perform poorly on clinical and biomedical inference tasks while remaining overconfident in their wrong answers. Do standard benchmarks hide this fragility, and can prompting techniques fix it?

Can large language models translate natural language to logic faithfully?

This explores whether LLMs can convert natural language statements into formal logical representations without losing meaning. It matters because faithful translation is essential for any AI system that reasons formally or verifies specifications.