INQUIRING LINE

Can pretraining-frequency signals alone prevent RAG systems from confabulating about common knowledge?

This explores whether knowing how often something appeared in training data (its rarity or commonness) is enough, by itself, to stop a RAG system from making up plausible-sounding errors about widely-known facts.


This reads the question as: can a single signal — how frequently a fact appeared in pretraining — carry the whole burden of preventing confabulation, specifically on common knowledge? The corpus answer is a clean no, and the most direct reason is almost the opposite of what you'd expect. Work on retrieval triggers found that confidence signals and rarity signals catch *orthogonal* failure modes: rarity flags hallucinations about obscure entities the model rarely saw, but it systematically *misses* shaky reasoning about common things — exactly the territory your question worries about Should RAG systems use model confidence or data rarity to trigger retrieval?. By construction, a frequency signal says 'this is common, relax' precisely where common-knowledge confabulation happens. So rarity alone isn't just incomplete here; it's blind in the wrong direction.

Why does common knowledge stay dangerous even when it's well-represented? Because frequency cuts both ways. Strong priors from training are what let a model override the document you actually retrieved — it generates from what it 'knows' instead of what's in front of it, and plain prompting can't force it back Why do language models ignore information in their context?. High frequency builds exactly those dominant associations. There's even a measurable threshold to this: post-learning priming becomes predictable from a keyword's pre-learning probability, with a sharp cutoff around 10^-3 separating 'this primes' from 'this doesn't' Can we predict keyword priming before learning happens?. Frequency is a real, predictable lever — but it governs whether knowledge activates, not whether it's true in context.

The deeper trap is that some confabulation isn't a knowledge problem at all. Probing shows models can internally represent the truth and still express falsehoods — RLHF pushes them toward truth-*indifference* rather than truth-*ignorance* Does RLHF make language models indifferent to truth?. No frequency statistic touches that gap, because the failure lives between knowing and saying, not in how much data was seen.

What the corpus suggests actually works is layering signals that watch different failures. Semantic entropy catches confabulation by sampling several answers and measuring how much their *meanings* diverge — a self-referential uncertainty check that needs no frequency table at all Can we detect when language models confabulate?. ReAct-style methods interleave reasoning with live external lookups so errors get corrected by real feedback at each step rather than waved through Can interleaving reasoning with real-world feedback prevent hallucination?. And when a RAG system grows its own corpus, the safeguard is entailment and attribution verification, not a popularity score Can RAG systems safely learn from their own generated answers?.

The thing worth taking away: frequency is most useful for the case opposite to your question — catching nonsense about rare entities — and is weakest exactly on common knowledge, where strong priors and truth-indifference do the damage. The fix isn't a better frequency signal; it's pairing rarity with an internal-uncertainty or external-grounding signal so the two cover each other's blind spots.


Sources 0 notes