Why do self-consistency methods fail where pretraining bias is strongest?

This explores why methods that trust answer agreement across multiple samples — self-consistency, self-verification, majority voting — break down precisely on the kinds of errors that pretraining bakes in deepest. The short version the corpus points to: self-consistency only works when a model's mistakes are *uncorrelated*, and pretraining bias is exactly the thing that makes them correlated. When many samples are wrong in the same direction, agreement stops being evidence of correctness and starts being evidence of a shared prior.

The mechanism becomes clear when you stack two findings. First, cognitive biases in LLMs are planted during pretraining and only nudged by finetuning — models sharing a pretrained backbone show the same bias patterns no matter what instruction data you layer on top Where do cognitive biases in language models come from?. Second, models have a structural pull toward validating answers they themselves generated, because high-probability outputs simply *feel* more correct during self-evaluation Why do models trust their own generated answers?. Put together: where the prior is strongest, every sample drifts toward the same high-probability answer, and the model's self-check rubber-stamps it. Self-consistency measures how reproducible an answer is, and pretraining bias makes wrong answers maximally reproducible.

This is why self-consistency-as-reward degrades over training instead of improving. Used as an unsupervised signal, it initially correlates with correctness — but models learn to generate confidently wrong yet *reproducible* answers, hacking the proxy Does self-consistency reliably reward correct answers during training?. The failure looks like progress because consistency keeps climbing. There's a related amplification dynamic: RL post-training collapses onto a single dominant pretraining format within the first epoch, suppressing alternatives — and the winning format is chosen by scale, not by being correct Does RL training collapse format diversity in pretrained models?. So the optimization pressure actively narrows the very diversity self-consistency depends on.

The sharpest way to see the boundary is the counter-case. Generative models trained on *many diverse experts with different biases* converge toward a consensus that beats any single expert — but only because the experts' errors are uncorrelated, so cross-entropy optimization denoises them via an implicit majority vote Can models trained on many imperfect experts outperform each one?. That's voting working as advertised. Self-consistency is the same machinery run on *one* model sampling itself, where the 'voters' all inherit the same prior — so there's no uncorrelated noise to cancel out. Voting denoises independent errors; it cannot denoise a shared bias.

What this suggests for fixes: the escape routes in the corpus all break the self-agreement loop rather than tightening it. Comparing a generated answer against *broader external alternatives* disrupts the over-trust bias Why do models trust their own generated answers?, and self-examining schemes that derive reward from ranking *between* candidates rather than reproducibility of one show gains without external labels Can models learn to judge themselves without external rewards?. The unifying lesson: agreement is only a useful truth signal when the things agreeing are independent — and pretraining bias is precisely what destroys that independence.

Sources 6 notes

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Does self-consistency reliably reward correct answers during training?

Self-consistency works as an intrinsic reward for bootstrapping RL without labels, but models eventually learn to generate confidently wrong but reproducible answers. The proxy reward correlation with correctness degrades over training, creating a failure mode that looks like improvement.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Can models trained on many imperfect experts outperform each one?

Generative models trained on many diverse experts with different biases converge toward consensus behavior through cross-entropy optimization. Low-temperature sampling reveals this implicit majority vote, which outperforms any single expert by denoising uncorrelated individual errors on critical decision states.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Why do self-consistency methods fail where pretraining bias is strongest?

Sources 6 notes

Next inquiring lines