How do LLM biases manifest differently across the three paradigms?

This explores how LLM biases differ depending on where they enter and how they surface — but the corpus doesn't actually carve bias into three tidy paradigms, so I'll read it as: do the same biases look different across origin, mechanism, and deployment context?

First, a flag: the collection doesn't organize LLM bias into three named paradigms. If that phrasing came from a specific paper, I'm not seeing it here. What the corpus *does* give you is something more useful — three different lenses on the same problem, and biases look genuinely different through each one.

The first lens is **origin**: where the bias is planted versus where it's nudged. A causal experiment varying random seeds and cross-tuning found that cognitive biases are baked in during pretraining, and instruction finetuning only sways them — models sharing a backbone share their bias fingerprints no matter what data they're tuned on Where do cognitive biases in language models come from?. That reframes a lot of debiasing work: if the bias lives below the level of instruction, prompt-level fixes are treating symptoms. Persona work confirms this from the other side — assigning an identity makes a model 90% more likely to accept identity-congruent evidence, and standard prompt debiasing fails to touch it Do personas make language models reason like biased humans?.

The second lens is **mechanism**: human-mirroring biases versus machine-native ones. On one side, models reproduce human cognitive biases almost eerily — matching human belief-bias error rates item-by-item across syllogisms and Wason tasks Do language models show the same content effects humans do?, and showing the same optimism-for-my-choices, pessimism-for-alternatives asymmetry humans do, which vanishes the moment you remove agency framing Do language models learn differently from good versus bad outcomes?. On the other side sit failure modes humans don't have: emotional 'rebound,' where the same question gets different answers depending on the tone you ask it in Does emotional tone in prompts change what information LLMs provide?, and over-reliance on moral language — 22% more than humans use, on a channel separate from sentiment Do LLMs use moral language more than humans?.

The third lens is **deployment context**, where bias takes on the shape of the task. In recommendation, the same pretrained substrate produces position, popularity, and fairness biases that collaborative-filtering fixes can't address because they're inherited from the language objective, not the interaction data Where do recommendation biases come from in language models?. In evaluation, judges fall for authority, verbosity, position, and 'beauty' bias — but training them to actually reason through a verdict, rather than skim surface features, substantially reduces it Can reasoning during evaluation reduce judgment bias in LLM judges?. And a deeper structural bias surfaces in argument-weighing: models can't tell an expert's claim from a common assumption, because they process text without the social world — reputation, track record — that gives expertise its force Can language models distinguish expert arguments from common assumptions?.

The thread worth carrying away: across all three lenses, the biases that resist fixing are the ones rooted in pretraining and architecture, while the ones that respond to intervention are the ones the model picks up at the task surface. So 'how does the bias manifest' often comes down to 'how deep is it' — and the corpus suggests depth, not category, is the real axis you're asking about.

Sources 9 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Do personas make language models reason like biased humans?

Assigning personas to LLMs induces identity-congruent evaluation bias, with models 90% more likely to accept evidence matching their assigned identity. Standard prompt-based debiasing fails to mitigate this effect, suggesting the bias operates below the level of instruction.

Do language models show the same content effects humans do?

LLMs show identical content-sensitivity patterns to humans on NLI, syllogisms, and Wason tasks, with belief-bias signatures matching human error rates item-by-item. This behavioral isomorphism across three independent tasks suggests content and logical form are inseparable in transformer reasoning architecturally.

Do language models learn differently from good versus bad outcomes?

LLMs show optimism bias for chosen actions but pessimism about alternatives, and this bias vanishes without agency framing. Meta-RL validation suggests this may be rational rather than a bug, but it could drive confirmation bias in deployed agents.

Does emotional tone in prompts change what information LLMs provide?

GPT-4 exhibits emotional rebound (negative prompts yield ~86% neutral-positive responses) and a tone floor (positive prompts rarely go negative), causing identical questions to receive different answers depending on emotional framing. This bias is suppressed only on sensitive topics where alignment constraints override tone effects.

Do LLMs use moral language more than humans?

Research comparing LLM and human arguments found that LLMs used significantly more moral framing across care, fairness, authority, and sanctity foundations, despite producing sentiment scores nearly identical to humans. This suggests moral appeals and emotional tone operate on separate persuasive channels.

Where do recommendation biases come from in language models?

Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.

Can reasoning during evaluation reduce judgment bias in LLM judges?

Training judges with reinforcement learning to reason about evaluations—by converting judgment tasks into verifiable problems with synthetic data pairs—produces judges that think through their decisions rather than relying on exploitable surface features, directly mitigating authority, verbosity, position, and beauty bias.

Can language models distinguish expert arguments from common assumptions?

LLMs lose the social context that gives expert claims their force—reputation, track record, and standing—because they process only text, not the social world where expertise is built and evaluated.

How do LLM biases manifest differently across the three paradigms?

Sources 9 notes

Next inquiring lines