How does surface salience compete with background knowledge in model inference?
This explores the tug-of-war between what's prominent on the surface of a prompt (memorized phrasings, stated assumptions, familiar-sounding claims) and what the model 'knows' from training — and which one wins when they disagree.
This explores the tug-of-war between surface salience — what's prominent in the text in front of the model — and background knowledge baked in from training, and which one steers the answer when they conflict. The corpus is interesting precisely because it shows the competition cuts both ways, and neither side reliably wins.
In one direction, baked-in priors steamroll the surface. Why do language models ignore information in their context? shows models generating answers that contradict their own context whenever the training-time association is strong enough — and that plain prompting can't fix it; you have to intervene in the representations themselves. Do large language models reason symbolically or semantically? sharpens the why: when you decouple a task's semantics from its logic, performance collapses even with the correct rule sitting right there in the prompt. The model is leaning on familiar token associations, not the rule it was handed.
But flip the framing and the surface wins instead. Why do language models accept false assumptions they know are wrong? is the cleanest case: ask a model directly and it knows the fact, but bury a false assumption in the phrasing of a question and it goes along with it — false presuppositions drive more accommodation than correct knowledge drives rejection. Do LLMs predict entailment based on what they memorized? is the same pattern in logic's clothing: a model will call something 'entailed' just because the conclusion looks like something it saw in training (it's 'attested'), even when the premise is random noise. Here a salient, familiar-looking string overrides what the model actually knows about the relationship.
So the real answer to 'who wins' is: whichever signal is more confident, not whichever is more correct. Strong parametric priors beat weak context; salient familiar phrasings beat weakly-held knowledge. That reframes the competition as a calibration problem rather than a knowledge problem — which is why Can models learn to ignore irrelevant prompt changes? matters: it trains models to give the same answer to clean and cosmetically-altered prompts, blunting surface salience's grip directly. And it explains why surface-level prompt tricks have a hard ceiling — Can prompt optimization teach models knowledge they lack? shows prompting can only reorganize what's already in the weights, never supply what's missing.
The doorway worth walking through: the kind of knowledge changes how the contest plays out. Does procedural knowledge drive reasoning more than factual retrieval? finds that reasoning rides on broad, transferable procedural patterns while factual recall depends on narrow, document-specific memorization. That suggests surface salience hijacks the brittle, memorized facts most easily — the place where the model's 'knowledge' is really just a remembered string — while genuinely procedural competence is harder for a salient distractor to knock over.
Sources 7 notes
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.
The FLEX Benchmark shows that models reject false presuppositions at rates far below acceptable levels (GPT-4: 84%, Mistral: 2.44%), even when direct knowledge questions prove they know the correct facts. False presuppositions drive more accommodation than correct knowledge drives rejection.
McKenna et al. (2023) identified attestation bias: LLMs predict entailment based on whether the hypothesis appears in training data, not whether the premise actually supports it. Random premise experiments show models maintain high entailment predictions when hypotheses are attested, proving they respond to memorized propositions rather than premise-hypothesis relationships.
Two methods—BCT (output-level) and ACT (activation-level)—train models to respond identically to clean and wrapped prompts by using the model's own clean responses as targets, eliminating specification and capability staleness inherent in standard SFT.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.