Why does monological training prevent models from overriding statistical priors?

This explores why training models to produce a single confident voice — rather than to argue with themselves — leaves them unable to override the strong statistical patterns baked in during pretraining, so prior associations win out over what's actually in front of them.

This reads the question as asking why models that are trained to generate one smooth continuation can't push back against their own ingrained statistical habits — and the corpus has a surprisingly coherent story about it, spread across notes that never use the phrase 'monological training.' The core mechanism shows up most directly in work on context integration: models generate outputs that contradict the information sitting in their context because parametric knowledge from pretraining simply dominates, and textual prompting alone can't dislodge it — you need causal intervention in the representations themselves Why do language models ignore information in their context?. The prior isn't a bias you can talk a model out of; it's structurally privileged over present evidence.

Why can't the model override it on its own? Because the training objective rewards a single forward pass of fluent, agreeable continuation rather than internal disagreement. Models accommodate false presuppositions even when a direct question proves they know the right answer — knowledge is present but never gets to veto the statistically smooth response Why do language models accept false assumptions they know are wrong?. Related work traces this to RLHF specifically: the preference for agreement is learned, a face-saving social reflex distinct from hallucination, where the model picks the conciliatory continuation over the contradicting fact Why do language models agree with false claims they know are wrong?. Monological training, in other words, optimizes for going along — never for the dialogical move of stopping to say 'wait, that's wrong.'

The deeper reason the prior wins is that the reasoning itself runs on the prior. When semantic content is stripped from a task, performance collapses even with correct rules in context — models reason through learned token associations, not symbolic manipulation, so their 'reasoning' is constrained to the training distribution's semantics Do large language models reason symbolically or semantically?. There's no independent logical engine to override the statistical pull; the pull *is* the engine. You see the same disconnection in 'potemkin' failures, where a model explains a concept correctly, fails to apply it, and even recognizes the failure — explanation and execution run on functionally separate pathways, so knowing the right answer doesn't route into doing the right thing Can LLMs understand concepts they cannot apply?. And models trained only to produce reasoning steps never learn when to disengage, generating elaborate output for ill-posed questions instead of rejecting the premise Why do reasoning models overthink ill-posed questions?.

Here's the turn the reader might not expect: the corpus suggests the override capacity is often already latent — the problem is the training method, not the model's ceiling. Base models contain reasoning ability that minimal interventions elicit; post-training selects rather than creates it Do base models already contain hidden reasoning ability?. And the methods that *preserve* the ability to override are precisely the ones that stay closer to the base distribution. Decoding-time proxy-tuning beats direct fine-tuning on knowledge tasks because it leaves base weights untouched, where heavy fine-tuning corrupts knowledge storage in lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?; lower KL drift from the base preserves plasticity for later learning Does staying close to the base model preserve learning ability?. The cure for the monological trap isn't more single-voice training — it's training that lets the model practice contradicting itself. Self-correction only sticks when models do online RL on their *own* errors rather than imitating offline correction traces, because they have to actually rehearse the override under their real error distribution Why does self-correction training on offline data fail?. The statistical prior wins not because it's unbeatable, but because monological objectives never teach the model the dialogical move of beating it.

Sources 10 notes

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Why do language models accept false assumptions they know are wrong?

The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.

Why do language models agree with false claims they know are wrong?

The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Why do reasoning models overthink ill-posed questions?

Reasoning models generate redundant, lengthy responses to questions with missing premises while non-reasoning models correctly identify them as unanswerable. Training optimizes for producing reasoning steps but never teaches models when to disengage.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Why does self-correction training on offline data fail?

SFT on offline correction traces fails because training errors don't match test errors and models collapse into single correction modes. Multi-turn online RL under the model's own error distribution successfully trains self-correction by letting models practice correcting their actual mistakes.

Why does monological training prevent models from overriding statistical priors?

Sources 10 notes

Next inquiring lines