How does distributional shift toward rare inputs change memorization reliance?

This explores what happens to a model's dependence on memorized patterns (vs. genuine generalization) when the inputs it sees drift toward rare, low-frequency cases the model saw little of during training.

This explores what happens to a model's reliance on memorization when inputs shift toward the rare tail of the distribution — and the corpus's clearest answer is that memorization reliance *increases* exactly where it's least trustworthy. In chain-of-thought reasoning, token-level memorization breaks down into local, mid-range, and long-range sources, and "local" memorization — predicting the next token from the immediately preceding ones — accounts for up to two-thirds of reasoning errors, with that share climbing precisely as complexity rises and distributional shift sets in Where do memorization errors arise in chain-of-thought reasoning?. So the rare-input regime isn't where memorization quietly recedes; it's where the model leans on shallow memorized continuations *more*, and those crutches fail.

There's a structural reason rare inputs are special. Rarity isn't the same as conceptual difficulty — it's a signal of distance from the pre-training distribution. One line of work reframes curriculum learning around exactly this, training on rare data *first* because rarity marks where the model's distribution is weakest, not where the material is pedagogically hard Does ordering training data by rarity actually improve language models?. And frequency has a hidden directional pull: because general concepts (hypernyms) appear far more often than specific ones (hyponyms), a model's frequency bias quietly drifts outputs toward abstraction, erasing the expert-level specificity that rare inputs often demand Does word frequency correlate with semantic abstraction?. Rare inputs, in other words, push against the model's strongest grooves.

The architecture that lets models *handle* the rare tail gracefully is instructive here. Wide & Deep models split the labor: a deep generalization tower covers the common cases, while a wide memorization tower (cross-product features) exists specifically to patch rare items the deep part can't capture — and because the deep part absorbs the bulk, the memorization component can stay small without overfitting Can one model memorize and generalize better than two? Can one model handle both memorization and generalization?. That's the optimistic version: memorization is *deliberately* the rare-input specialist. The pessimistic version is what happens when a single distribution has no such division of labor and the rare-input pressure simply surfaces brittle memorized shortcuts.

What's genuinely surprising is that models seem to have an adaptive response to this stress. As tasks grow unfamiliar and shift out-of-distribution, LLM hidden states *sparsify* — activations become localized and selective in a way that correlates with task unfamiliarity, and this looks like a stabilizing filter rather than a breakdown Do language models sparsify their activations under difficult tasks?. That hints memorization reliance under shift may be partly self-regulating: the model narrows what it draws on. But there's a threshold quality to memorization too — keyword priming after learning is predictable from pre-learning probability, with a sharp cutoff around 10⁻³ separating contexts where memorized priming kicks in from those where it stays dormant Can we predict keyword priming before learning happens?. Rare inputs live near that cliff edge.

The practical sting comes from training choices that *amplify* the problem. Distilling from teachers conditioned on the right answer produces confident, concise student traces that suppress uncertainty — great in-domain, but it strips out exactly the epistemic caution that rare, out-of-distribution problems require, trading tail robustness for in-distribution polish Does richer teacher context hurt student generalization?. So the through-line: distributional shift toward rare inputs doesn't reduce memorization reliance — it concentrates it, exposes its thresholds, and rewards architectures and training regimes that quarantine memorization as a specialist tool rather than letting it masquerade as reasoning.

Sources 8 notes

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does ordering training data by rarity actually improve language models?

CTFT fine-tunes LLMs on rare data first because rarity signals distributional weakness, not conceptual difficulty. This reframes curriculum learning as managing distance from pre-training distribution rather than pedagogical scaffolding.

Does word frequency correlate with semantic abstraction?

WordNet analysis shows hypernyms (general concepts) occur more frequently than hyponyms (specific ones). Combined with LLMs' frequency bias, this means preferring common paraphrases systematically drifts toward abstraction, erasing expert-level specificity.

Can one model memorize and generalize better than two?

Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.

Can one model handle both memorization and generalization?

Wide & Deep architectures train a sparse cross-product tower and a dense embedding tower together, allowing the wide part to patch only the deep part's weaknesses. This joint approach requires smaller models than ensemble methods.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

How does distributional shift toward rare inputs change memorization reliance?

Sources 8 notes

Next inquiring lines