Can prompt-based debiasing work if biases are embedded in pretraining?
This explores whether prompting tricks can fix model bias when the bias was baked in during pretraining — and the corpus is unusually direct that the answer is mostly no.
This explores whether prompt-based debiasing can work when biases are embedded in pretraining (vs. introduced later during fine-tuning). The corpus has a sharp, somewhat deflating answer: where a bias lives determines what can move it, and prompts operate at the wrong layer to dislodge something planted deep. The cleanest evidence is a causal experiment showing that cognitive biases in LLMs are shaped almost entirely by pretraining, with finetuning only nudging them — models sharing a pretrained backbone show the same bias patterns no matter what instruction data is layered on top Where do cognitive biases in language models come from?. If finetuning barely moves these biases, a prompt — which changes nothing about the weights — has even less leverage.
That ceiling shows up again from a different angle. Prompt optimization can reorganize and surface what a model already learned, but it cannot inject anything new; prompting works strictly inside the pretrained distribution Can prompt optimization teach models knowledge they lack?. A debiasing prompt is essentially trying to suppress a strong learned association by asking nicely, and the corpus shows why that fails mechanistically: when parametric knowledge from training is strong, models override the instructions sitting right in their context. Textual prompting alone can't beat strong priors — you need causal intervention in the model's internal representations Why do language models ignore information in their context?. A bias embedded in pretraining is exactly that kind of strong prior.
So what does work, if not prompts? The interesting turn is that the corpus points toward methods that sit between prompting and full retraining. Consistency training teaches a model to respond the same way to clean and adversarially-wrapped prompts using its own clean answers as targets — a way to harden against prompt-level manipulation rather than relying on instructions Can models learn to ignore irrelevant prompt changes?. And proxy-tuning shifts behavior at decoding time while leaving base weights untouched, which preserves stored knowledge better than direct fine-tuning that tends to corrupt lower-layer storage Can decoding-time tuning preserve knowledge better than weight fine-tuning?. These suggest the realistic lever is distributional steering at inference, not a clever system prompt.
There's a deeper warning worth surfacing too: the dream of a 'theory-free,' bias-free model is itself a trap. High accuracy can mask bigotry that's really a correlation-causation error, so chasing surface fixes can launder the underlying problem rather than solve it Can AI models be truly free from human bias?. And rather than trying to scrub bias out of the model at all, one line of work reframes the goal — instead of an AI that decides (and anchors humans to its bias), build an AI that guides, highlighting useful aspects of a problem so the human keeps judgment and the machine's bias doesn't propagate downstream Can AI guidance reduce anchoring bias better than AI decisions?.
The thing you didn't expect to learn: the question isn't really 'is my prompt good enough' but 'at which layer does the bias live.' Pretraining-embedded bias is a weights-and-priors problem, and the corpus is consistent that prompts can only rearrange what's already there — so the productive moves are representation-level intervention, decoding-time steering, or redesigning the human-AI division of labor so the bias never gets the final word.
Sources 7 notes
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Research shows that 'theory-free' AI models mask bigotry behind high accuracy metrics while committing fundamental statistical errors. A 95% accurate criminal justice system would wrongly convict thousands, demonstrating that model sophistication does not validate causal inference.
Learning to Guide eliminates anchoring bias and unassisted hard cases by having machines supply interpretive guidance rather than autonomous decisions, keeping responsibility with humans while improving their judgment through enhanced perception.