How does dialogue during training shape the ability to ignore word frequency?

This reads the question as: does the way models learn from conversation — RLHF, preference optimization, multi-turn reward — actually build the ability to override raw statistical word-frequency priors and stay faithful to what's in front of them, or does it entrench those priors?

This explores whether dialogue-style training (the reward signals that shape a model after pretraining) helps a model break free of word-frequency pull — the tendency to answer from how common a token is rather than from the context it's been given. The corpus suggests the answer is mostly the opposite of what you'd hope: word frequency is a stubborn gatekeeper, and standard conversational training does little to loosen its grip — though one quieter line of work shows the ability can be trained directly.

Start with how deep the frequency pull goes. Can we predict keyword priming before learning happens? finds that whether a model absorbs a newly-taught fact is predictable from how probable the keyword was *before* learning — there's a sharp ~10^-3 threshold below which new information simply doesn't take. Frequency isn't just a bias the model carries; it decides what can even be learned. At inference time the same gravity shows up in Why do language models ignore information in their context?: when a model's trained associations are strong, they override the actual context, and — crucially — prompting can't fix it. Only intervening directly in the representations does. So 'ignoring word frequency' really means overriding a parametric prior that wants to win by default.

Here's the uncomfortable part for dialogue training: the reward signals that shape conversational models tend to push *toward* priors, not away. Does preference optimization harm conversational understanding? shows preference optimization rewards confident, single-turn answers — the model learns to commit fluently rather than check what was actually meant. Why don't language models develop conversation maintenance skills? makes the structural version of the point: training signals reward predicting likely information, not the relational work of staying grounded in a specific exchange. And Does RLHF make language models indifferent to truth? shows RLHF can make a model produce statistically-plausible claims while its internal probes still know better — fluency over fidelity. Each of these is a case of frequency winning because the training objective quietly rewards it.

The hopeful counter-thread is that ignoring surface statistics *can* be trained as an explicit target rather than left to RLHF's incentives. Can models learn to ignore irrelevant prompt changes? trains a model to respond identically whether a prompt is clean or wrapped in distracting framing — using the model's own clean answers as the teaching signal. That's the missing mechanism: not 'be helpful,' but 'be invariant to the irrelevant.' It's the closest thing the corpus has to deliberately teaching a model to discount surface frequency in favor of meaning.

The thing worth taking away: there isn't a single paper here on 'dialogue training versus word frequency,' but laid side by side the corpus tells a coherent story — frequency is the default winner, conventional conversational reward signals reinforce that default, and overriding it takes either a representation-level intervention or a training objective built specifically around invariance. If standard dialogue training shapes anything, it more often deepens the frequency reflex than dissolves it. (For the adjacent failure mode — reward shape steering behavior the wrong way — Why do language models respond passively instead of asking clarifying questions? is a useful doorway.)

Sources 7 notes

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Does preference optimization harm conversational understanding?

RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.

Why don't language models develop conversation maintenance skills?

Humans keep conversations smooth through implicit techniques like reference repair and topic hand-off that sustain relational interaction, not convey information. Language models don't develop these because training signals reward information prediction, not relational work.

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Can models learn to ignore irrelevant prompt changes?

Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.

Why do language models respond passively instead of asking clarifying questions?

CollabLLM demonstrates that standard RLHF training optimizes for immediate helpfulness, discouraging models from asking clarifying questions or offering multi-turn insights. Multi-turn-aware rewards that estimate long-term interaction value enable active intent discovery and genuine collaboration.

How does dialogue during training shape the ability to ignore word frequency?

Sources 7 notes

Next inquiring lines