Reinforcement Learning for LLMs Language Understanding and Pragmatics LLM Reasoning and Architecture

Where do cognitive biases in language models originate?

Cognitive biases in LLMs vary across models, but their source remains unclear. Understanding whether pretraining, finetuning, or training randomness drives these biases is crucial for designing effective debiasing interventions.

Note · 2026-04-07 · sourced from Flaws
What kind of thing is an LLM really?

Prior work established that LLMs exhibit systematic cognitive biases analogous to those studied in humans — anchoring, availability, base-rate neglect, confirmation bias, and over 30 others — and that these biases vary across models and are often amplified by instruction tuning. What remained unclear: do these differences originate in pretraining, in finetuning, or in random noise from training stochasticity? The question matters because the answer determines where debiasing efforts should be directed and what to expect from new models.

The Planted in Pretraining, Swayed by Finetuning paper answers this with a two-step causal experimental approach. First, it finetunes models multiple times using different random seeds, measuring how training randomness alone affects bias scores across more than 30 cognitive biases. Second, it introduces cross-tuning — swapping instruction datasets between models to isolate bias sources. If biases were primarily driven by finetuning data, then swapping datasets between models should swap the bias patterns. If biases were primarily driven by the pretrained backbone, then swapping datasets should leave bias patterns largely intact.

The results: while training randomness introduces some variability, biases are mainly shaped by pretraining. Models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. The finetuning dataset modulates existing tendencies but does not create them. Cognitive biases are planted at pretraining and only swayed afterward.

This extends a broader pattern in the vault. Do base models already contain hidden reasoning ability? establishes that reasoning capability is pretraining-determined; RL and finetuning surface what the base model already contains. Does RLVR actually expand what models can reason about? and Why does RLVR work with completely random rewards? extend this to RLVR: the reward signal matters less than the pretraining it activates. Now the same pattern applies to cognitive biases: pretraining sets them, finetuning modulates them. Across reasoning, RLVR, and bias, the finding is the same — post-training is a lever on pretraining, not a source of new structure.

The practical implication is uncomfortable. Do personas make language models reason like biased humans? already documented that prompt-based debiasing fails. This paper explains why: the biases are deeper than the surface at which prompts operate. If cognitive biases are pretraining-deep, then finetuning interventions targeting specific biases will mostly fail — they will dampen the surface expression but leave the underlying structure intact. The bias will reappear under any prompt condition that bypasses the finetuned dampening, which is most conditions. Real debiasing would require intervening at pretraining — filtering training data for the biases present in human-written text — and there is no current mechanism for doing that at scale.

This also reframes what How much poisoned training data survives safety alignment? is measuring. Pretraining poisoning persists not because of the specific data but because pretraining-depth is where behavioral tendencies live. The finetuning-as-cleanup intuition — that alignment training can scrub problems out of pretrained models — is structurally wrong in the same way that finetuning-as-debias is wrong. Both treat post-training as capable of rewriting what pretraining installed. It isn't.


Source: Flaws

Related concepts in this collection

Concept map
20 direct connections · 208 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

cognitive biases in LLMs are mainly shaped by pretraining not finetuning — models sharing a pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data