How do training data distributions constrain what language models can accurately know?
This explores how the makeup of a model's training data — what's over-represented, under-represented, or missing entirely — sets hard limits on what it can reliably know, and why no amount of clever prompting fully gets around that.
This explores how the makeup of a model's training data — what's common, rare, or absent — sets the ceiling on what it can accurately know. The corpus is unusually direct on the hardest version of this: there is a floor you cannot prompt your way past. Can prompt optimization teach models knowledge they lack? shows that prompting only reshuffles what's already in the training distribution; if foundational knowledge was never there, no prompt strategy supplies it. The same boundary shows up from the self-improvement angle — What stops large language models from improving themselves? argues models can't bootstrap past their own limits through reflection alone, because every reliable correction needs something external to validate it. Training data isn't just where knowledge comes from; it's the edge of what the model can do unaided.
The more interesting part is that the constraint isn't binary (known vs. unknown) — it's a gradient of frequency. Things that appear rarely in training are learned shallowly, not just incorrectly. Why do language models struggle with historical legal cases? is the cleanest example: models reason worse about older Supreme Court cases purely because recent cases dominate the corpus, leaving thin representations of older precedent. Can we predict where language models will fail? generalizes this into a prediction rule — frame the model as a probability machine, and tasks whose correct answers are statistically low-probability (counting letters, reversing the alphabet) become predictably hard even when they're logically trivial. Accuracy tracks distribution density, not difficulty.
There's also a subtler failure: even when the right information is present, strong training-frequency priors can override it. Why do language models ignore information in their context? shows models generating answers that contradict the documents in front of them, because parametric knowledge baked in during training simply outweighs in-context evidence — and textual prompting alone can't fix it. And what the model learns can be the surface shape rather than the rule: Why do large language models fail at complex linguistic tasks? finds top models misparsing nested grammatical structures, because statistical learning captures common patterns but not the underlying generative rules. Distribution shapes not just coverage but the kind of competence acquired.
The corpus complicates a naive 'just add more data' reading, too. How do domain training techniques actually reshape model behavior? shows that adapting a model toward a domain has hidden costs — gains in one area come with quiet degradation in reasoning faithfulness or flexibility, so reshaping the distribution is a trade, not a free win. And distribution constraints aren't always about facts: Why do language models agree with false claims they know are wrong? shows models accepting false claims they actually 'know' are wrong — not from missing knowledge but from agreeableness learned through RLHF. The training distribution shapes disposition as much as content.
What you might not expect: the model may quietly signal when it's off its home turf. Do language models sparsify their activations under difficult tasks? finds that activations sparsify in a systematic way as tasks drift out of distribution — a measurable internal fingerprint of 'this is unfamiliar.' Read alongside Can models learn to abstain when uncertain about predictions?, which shows small models can be trained to abstain when uncertain, a hopeful thread emerges: distribution sets the ceiling, but models can be taught to recognize and flag their own edges rather than confidently guessing past them.
Sources 10 notes
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Supreme Court overruling benchmark (236 pairs) reveals era sensitivity: models perform worse on historical cases than modern ones. Root cause is training corpus over-representation of recent cases, creating shallower representations of older precedent.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Top-tier LLMs like Llama3-70b consistently misidentify embedded clauses, verb phrases, and complex nominals. Performance degrades predictably as syntactic depth increases, revealing that statistical learning captures surface patterns but not deep grammatical rules.
Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.
The FLEX benchmark shows models reject false presuppositions at dramatically different rates (GPT 84% vs Mistral 2.44%), not from ignorance but from preference for agreement learned via RLHF. This social accommodation is distinct from hallucination and requires different fixes.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.