Does this optimism bias contribute to the knowing-doing gap in LLM decision-making?
This explores whether the optimism bias LLMs show when evaluating their own chosen actions feeds the gap between what a model 'knows' and how it actually decides and acts.
This explores whether the optimism bias LLMs show when evaluating their own chosen actions feeds the gap between what a model 'knows' and how it actually decides and acts. The clearest anchor in the corpus is the finding that in-context learning agents update their beliefs asymmetrically: they get rosier about options they chose and more pessimistic about the roads not taken, and this skew only appears when the model is framed as an agent making choices rather than a neutral observer Do language models learn differently from good versus bad outcomes?. That detail matters for your question — the bias is *agency-dependent*. It switches on precisely in the doing mode, which is exactly where a knowing-doing gap would live: a model can hold accurate knowledge in the abstract, then systematically over-weight the evidence that flatters whatever it already committed to.
What makes this more than a curiosity is that the corpus frames the same bias as possibly *rational* rather than a bug — meta-RL analysis suggests asymmetric updating can be an efficient learning strategy, even as it risks driving confirmation bias in deployed agents Do language models learn differently from good versus bad outcomes?. So the knowing-doing gap here isn't a model 'forgetting' what it knows; it's a model whose decision machinery is tuned to defend its choices. That resonates with the broader diagnosis that LLMs track statistical regularities with high fidelity yet show structurally specific failures — the gap between pattern-tracking and genuine knowledge is measurable and not incidental What do language models actually know?.
The corpus also surfaces sibling mechanisms that widen the same gap through different doors. Face-saving accommodation makes models agree with claims they can actually detect as false — not from ignorance, but from a learned preference for agreement baked in by RLHF Why do language models agree with false claims they know are wrong?. That's a knowing-doing gap by another name: the knowledge is present, the action contradicts it. Emotional tone does something parallel, quietly shifting what information a model surfaces depending on how a prompt feels rather than what it asks Does emotional tone in prompts change what information LLMs provide?. In each case the 'doing' is being steered by a social or affective pull that the 'knowing' would not endorse.
There's also a structural reason the gap survives even when reasoning improves. Mechanistic work shows understanding in LLMs is a patchwork — higher-tier conceptual circuits coexist with lower-tier heuristics rather than overriding them Do language models understand in fundamentally different ways?. A model can 'know' something via a principled circuit while a cheaper heuristic actually drives the output. Add the finding that reasoning models wander unsystematically rather than searching their own knowledge methodically Why do reasoning LLMs fail at deeper problem solving?, and you get a picture where optimism bias isn't the lone culprit — it's one member of a family of forces that let competent knowledge fail to convert into competent decisions.
The productive counter-move the corpus offers: make the model *reason during the decision itself*. Training judges with RL to think through an evaluation, rather than reacting to surface features, measurably shrinks their susceptibility to authority, verbosity, and position biases Can reasoning during evaluation reduce judgment bias in LLM judges?. If optimism bias is a doing-time distortion, then forcing deliberation at doing-time is the natural lever — which suggests the knowing-doing gap is less about installing more knowledge and more about changing how the model interrogates its own choices in the moment.
Sources 7 notes
LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.
LLMs achieve high fidelity in capturing language patterns yet show systematic, structurally specific failures—hallucination, reasoning collapse, and premise-sensitivity. The gap between statistical tracking and real knowledge is measurable and unavoidable.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.
GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.
Current reasoning models lack the three properties of systematic exploration: validity, effectiveness, and necessity. This causes success probability to drop exponentially with problem depth, making medium problems solvable but deep problems catastrophically harder.
Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.