Can data filtering during pretraining prevent cognitive biases in language models?

This explores whether cleaning the training data before pretraining can stop language models from absorbing cognitive biases — and the corpus suggests filtering is a weaker lever than it sounds, because biases enter through statistical signatures rather than spottable content.

This explores whether cleaning the training data before pretraining can stop a model from absorbing cognitive biases. The most direct answer in the corpus is sobering: biases are largely a *pretraining* phenomenon, not something you patch later. A causal experiment varying random seeds and cross-tuning models found that any models sharing a pretrained backbone show the same bias patterns no matter what instruction data you finetune on — biases are planted during pretraining and only nudged afterward Where do cognitive biases in language models come from?. So the *timing* instinct behind the question is right: if you want to intervene, pretraining is where the leverage is. The harder question is whether filtering the data is a strong enough intervention.

Here the corpus turns skeptical. The most striking finding is that behavioral traits can pass between models through data that has *no semantic relationship to the trait at all*, and the effect persists even after rigorous filtering — because what's being transmitted is a statistical signature, not readable content you could catch and remove Can language models transmit hidden behavioral traits through unrelated data?. If a trait can survive filtering precisely because it doesn't live in anything a filter can see, then filtering-as-prevention has a structural ceiling. You can scrub the obvious and still ship the bias.

Two adjacent results sharpen why. First, biases imprint fast and predictably: post-learning keyword priming is forecastable from a token's pre-learning probability, with as few as three exposures enough to establish the effect once you cross a roughly 10^-3 threshold Can we predict keyword priming before learning happens?. So a bias doesn't need to be common in the corpus to take hold — light contamination clears the bar. Second, once a prior is baked in, it dominates: models generate outputs that contradict their own context because parametric knowledge from training overrides in-context information, and prompting alone can't fix it — you need causal intervention in the representations Why do language models ignore information in their context?. Together these say a planted bias is both easy to plant and hard to talk a model out of afterward.

There's also a deeper trap worth naming: the dream of bias-free data assumes you can identify bias as a property of the dataset. The 'theory-free AI' critique argues this is a fallacy — models that look clean and accurate can launder bias through correlation-as-causation, where high accuracy metrics mask the harm rather than reveal it Can AI models be truly free from human bias?. And biases that look like reasoning are even slipperier: most models exploit a conservative default rather than actually evaluating constraints, performing *worse* when the constraint is removed — a bias hiding behind apparent competence Are models actually reasoning about constraints or just defaulting conservatively?. Filtering can't remove what you can't recognize as bias in the first place.

The takeaway you didn't know you wanted: pretraining filtering is necessary but not sufficient. It's the right stage to act on, but the corpus points toward interventions that work *on the model's internals* — causal edits to representations rather than curation of inputs — as the lever that actually moves entrenched bias. Filtering catches the biases that look like content; the ones that travel as statistics walk right through.

Sources 6 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Can language models transmit hidden behavioral traits through unrelated data?

Research demonstrates that behavioral traits propagate between models via filtered data bearing no semantic relationship to the trait. The effect is model-specific, fails across different architectures, and persists despite rigorous filtering—indicating the mechanism embeds statistical signatures rather than semantic content.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can AI models be truly free from human bias?

Research shows that 'theory-free' AI models mask bigotry behind high accuracy metrics while committing fundamental statistical errors. A 95% accurate criminal justice system would wrongly convict thousands, demonstrating that model sophistication does not validate causal inference.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Can data filtering during pretraining prevent cognitive biases in language models?

Sources 6 notes

Next inquiring lines