How does the pretraining distribution shape what LLMs find hard?

This explores why LLMs find some tasks hard — and the recurring answer across the corpus is that difficulty tracks rarity in the training data, not logical complexity in any human sense.

This explores why LLMs find some tasks hard, and the corpus keeps landing on the same surprising answer: a task is hard for an LLM mostly to the degree that its answer was *improbable* in pretraining — not to the degree it's logically complex. The clearest statement of this is the "embers of autoregression" work, which treats the model as a probability machine and predicts failures from the target's likelihood alone Can we predict where language models will fail?. Reversing the alphabet or counting letters is trivial for a child but rare in text, so the model stumbles — difficulty here is a property of the distribution, not the problem.

Once you see difficulty as *distance from the pretraining distribution*, a lot of scattered findings line up. Curriculum learning gets reframed: ordering training data rare-to-common works because rarity marks where the model's distribution is thin, which is a different thing from where a human would say the material is conceptually hard Does ordering training data by rarity actually improve language models?. Grammar shows the same fingerprint — models handle simple sentences and fail on deep recursion and embedding, the tell-tale sign of having absorbed surface frequency heuristics rather than structural rules Does LLM grammatical performance decline with structural complexity?. And when you fine-tune on inference tasks, the model often just deepens its reliance on which words are more common (hypernyms over hyponyms) instead of learning the actual relation, then breaks exactly on the adversarial cases where frequency and truth disagree Does fine-tuning on NLI teach inference or amplify shortcuts?.

The interesting wrinkle is that fine-tuning and RL don't escape this — they sharpen it. RL-tuned models post sharp drops on out-of-distribution variants of the same problem, suggesting they tightened template-matching against the training distribution rather than installing a procedure that transfers Do fine-tuned language models actually learn optimization procedures?. Even distillation inherits the trap: a teacher conditioned on the right answer produces confident, terse traces that students copy, buying in-domain polish at the cost of the epistemic caution OOD problems demand Does richer teacher context hurt student generalization?. The distribution's shape propagates downstream, too — LLM recommenders show position, popularity, and fairness biases that come straight from the pretraining objective and corpus demographics, not from any interaction data Where do recommendation biases come from in language models?.

What you might not expect: distributional hardness doesn't always look like degradation. As tasks drift out-of-distribution, hidden states sparsify in a systematic, localized way that seems to *stabilize* performance — a built-in coping mechanism for unfamiliarity rather than a failure Do language models sparsify their activations under difficult tasks?. And the failures themselves can be strange in ways pure "knowledge gap" framing misses: "potemkin" understanding, where a model explains a concept correctly, fails to apply it, and then recognizes its own failure — a split between explanation and execution pathways that no human cognition produces Can LLMs understand concepts they cannot apply?.

The payoff for a curious reader is a mental flip. "Hard for an LLM" and "hard for a person" are different axes that only sometimes overlap. The model's difficulty map is drawn by what its corpus saw a lot of versus rarely — which means you can often *predict* where it will fail before you test it, and why the fixes that feel intuitive (more fine-tuning, richer supervision) can quietly make the distributional dependence worse instead of better.

Sources 9 notes

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Does LLM grammatical performance decline with structural complexity?

LLMs show systematic performance decline as syntactic depth and embedding increase. Simple sentences are handled well while complex structures with recursion and embedding fail consistently, suggesting LLMs learned surface heuristics rather than structural grammar rules.

Does fine-tuning on NLI teach inference or amplify shortcuts?

NLI fine-tuning increases LLM reliance on corpus-level frequency patterns (hypernyms more common than hyponyms) rather than semantic relationships. Models perform worse on adversarial cases where frequency patterns contradict actual entailment labels, showing the shortcut was learned more deeply.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Where do recommendation biases come from in language models?

Wu et al. show that LLM-based recommendation systems exhibit position bias, popularity bias, and fairness bias—unique failure modes stemming from the language model's pretraining objective and corpus demographics rather than interaction data. Mitigation requires LLM-specific approaches, not adapted collaborative filtering techniques.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

How does the pretraining distribution shape what LLMs find hard?

Sources 9 notes

Next inquiring lines