Why does optimism bias disappear when LLMs passively observe outcomes?

This explores why LLMs only show optimism bias — over-weighting good news about their own choices — when they're cast as the agent making decisions, and why that asymmetry evaporates the moment they're framed as a bystander watching outcomes unfold.

This explores why optimism bias in LLMs seems to be switched on by *agency* rather than by the outcomes themselves. The most direct evidence comes from work showing that language models update their beliefs asymmetrically — they get more excited about good results from actions they 'chose' and stay pessimistic about the roads not taken — but this whole pattern collapses when you strip out the framing that they were the ones deciding Do language models learn differently from good versus bad outcomes?. Passive observation removes the 'self' whose choices need defending, so there's nothing for the asymmetry to attach to. Tellingly, that same study found the bias may be *rational* under a meta-reinforcement-learning lens rather than a simple flaw: weighting feedback about your own actions more heavily is a sensible learning strategy when you'll have to act again — and it makes no sense when you're just watching.

The deeper 'why' is that this is a borrowed human reflex, not a quirk the model invented. Humans show exactly this agency-gated optimism, and LLMs reproduce a striking range of human reasoning signatures item-for-item — the same content effects, the same belief-bias error rates on the same problems Do language models show the same content effects humans do?. If the asymmetry is baked into the human text the model learned from, then it lives wherever human self-serving reasoning lives in that text: in first-person, decision-making contexts. Recast the scenario as detached observation and you've moved outside the linguistic neighborhood where the pattern was learned.

That points to where the bias actually comes from. A causal experiment varying random seeds and cross-tuning models found that cognitive biases are planted during *pretraining* and only nudged by finetuning Where do cognitive biases in language models come from?. So optimism bias isn't a switch RLHF flipped — it's a deep statistical regularity of human-authored text that surfaces only when the prompt supplies the agency framing that co-occurs with it in the training distribution. No agency cue, no activation.

There's a useful caution lurking here too. You might hope to just *ask* a model whether it's being optimistic — but LLM self-reports mostly echo training-data patterns rather than genuine introspection, except in narrow cases with a real causal chain to report on Can language models actually introspect about their own states?. So the disappearance of the bias under passive observation isn't the model 'realizing' it should be neutral; it's the absence of the trigger condition. The thing that didn't fire simply stays quiet.

The part worth carrying away: the bias being agency-dependent is exactly what makes it dangerous in deployed agents. A model that reasons neutrally in a benchmark where it merely observes can flip into confirmation-biased reasoning the moment you put it in a loop where it takes actions and sees their results — the very setting we deploy 'agentic' systems in. The passive-observation case where the bias vanishes is the lab condition; the agentic case where it returns is production.

Sources 4 notes

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Why does optimism bias disappear when LLMs passively observe outcomes?

Sources 4 notes

Next inquiring lines