Where do cognitive biases in language models originate?
Cognitive biases in LLMs vary across models, but their source remains unclear. Understanding whether pretraining, finetuning, or training randomness drives these biases is crucial for designing effective debiasing interventions.
Prior work established that LLMs exhibit systematic cognitive biases analogous to those studied in humans — anchoring, availability, base-rate neglect, confirmation bias, and over 30 others — and that these biases vary across models and are often amplified by instruction tuning. What remained unclear: do these differences originate in pretraining, in finetuning, or in random noise from training stochasticity? The question matters because the answer determines where debiasing efforts should be directed and what to expect from new models.
The Planted in Pretraining, Swayed by Finetuning paper answers this with a two-step causal experimental approach. First, it finetunes models multiple times using different random seeds, measuring how training randomness alone affects bias scores across more than 30 cognitive biases. Second, it introduces cross-tuning — swapping instruction datasets between models to isolate bias sources. If biases were primarily driven by finetuning data, then swapping datasets between models should swap the bias patterns. If biases were primarily driven by the pretrained backbone, then swapping datasets should leave bias patterns largely intact.
The results: while training randomness introduces some variability, biases are mainly shaped by pretraining. Models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. The finetuning dataset modulates existing tendencies but does not create them. Cognitive biases are planted at pretraining and only swayed afterward.
This extends a broader pattern in the vault. Do base models already contain hidden reasoning ability? establishes that reasoning capability is pretraining-determined; RL and finetuning surface what the base model already contains. Does RLVR actually expand what models can reason about? and Why does RLVR work with completely random rewards? extend this to RLVR: the reward signal matters less than the pretraining it activates. Now the same pattern applies to cognitive biases: pretraining sets them, finetuning modulates them. Across reasoning, RLVR, and bias, the finding is the same — post-training is a lever on pretraining, not a source of new structure.
The practical implication is uncomfortable. Do personas make language models reason like biased humans? already documented that prompt-based debiasing fails. This paper explains why: the biases are deeper than the surface at which prompts operate. If cognitive biases are pretraining-deep, then finetuning interventions targeting specific biases will mostly fail — they will dampen the surface expression but leave the underlying structure intact. The bias will reappear under any prompt condition that bypasses the finetuned dampening, which is most conditions. Real debiasing would require intervening at pretraining — filtering training data for the biases present in human-written text — and there is no current mechanism for doing that at scale.
This also reframes what How much poisoned training data survives safety alignment? is measuring. Pretraining poisoning persists not because of the specific data but because pretraining-depth is where behavioral tendencies live. The finetuning-as-cleanup intuition — that alignment training can scrub problems out of pretrained models — is structurally wrong in the same way that finetuning-as-debias is wrong. Both treat post-training as capable of rewriting what pretraining installed. It isn't.
Source: Flaws
Related concepts in this collection
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
the parent pattern: pretraining sets capabilities; finetuning activates them
-
Does RLVR actually expand what models can reason about?
Explores whether reinforcement learning with verifiable rewards teaches models genuinely new reasoning capabilities or simply makes them more reliable at solving problems they already could solve.
same pattern for RL vs pretraining
-
Why does RLVR work with completely random rewards?
RLVR improves reasoning performance even with incorrect or random reward signals. This challenges the assumption that reward quality determines learning outcomes and raises questions about what RLVR is actually doing.
reward signal matters less than the pretraining it surfaces
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
timing not capability; analog at the RL level
-
How much poisoned training data survives safety alignment?
Explores whether adversarial contamination at 0.1% of pretraining data can persist through post-training safety measures, and which attack types prove most resilient to alignment.
pretraining-persistence applies to poisons and biases alike
-
Do personas make language models reason like biased humans?
When LLMs are assigned personas, do they develop the same identity-driven reasoning biases that humans exhibit? And can standard debiasing techniques counteract these effects?
prompt debiasing fails because biases are pretraining-deep
-
Do LLMs generalize moral reasoning by meaning or surface form?
When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.
moral-reasoning surface patterns may be pretraining-encoded
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
activation-level tracking as partial mitigation for pretraining-deep tendencies
-
Can we decouple what pretraining and fine-tuning each improve?
Does scaling at different training stages produce distinct capability improvements? This matters because it could reveal whether knowledge and behavioral alignment are truly separate properties we can optimize independently.
complementary decoupling at the capability level
-
Does fine-tuning on NLI teach inference or amplify shortcuts?
When LLMs are fine-tuned on natural language inference datasets, do they learn genuine reasoning abilities or become better at exploiting statistical patterns in the training data? Understanding this distinction matters for assessing model capabilities.
concrete instance: finetuning amplifies pretraining-baked frequency bias rather than teaching inference; mirrors the structure of this finding at the NLI-specific level
-
Does training objective determine which direction models fail at abstention?
Calibration failures might not be universal—different training approaches could push models toward opposite extremes of refusing or overconfidently answering. Understanding whether the training objective, not just model capability, drives these failures could reshape how we think about fixing them.
training objectives modulate direction of pretraining-deep tendencies in opposite ways without creating new structure; consistent with the bias-modulation-not-creation thesis
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
cognitive biases in LLMs are mainly shaped by pretraining not finetuning — models sharing a pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data