INQUIRING LINE

What distinguishes data that generalizes broadly from task-specific memorization?

This explores how the corpus draws the line between data that teaches transferable skills (broad generalization) and data that just gets memorized for a single task — and what makes the difference.


This explores what separates data that teaches a model broadly useful patterns from data it merely memorizes for one task. The cleanest answer in the collection is about *what kind of knowledge* a document carries, not how much. An analysis of five million pretraining documents found that reasoning draws on broad, transferable *procedural* knowledge — the how-to patterns that show up across many diverse sources — while factual recall leans on narrow, document-specific memorization, where the model essentially needs to have seen the exact target fact Does procedural knowledge drive reasoning more than factual retrieval?. So the distinguishing feature isn't the data point itself but whether the skill it conveys recurs across contexts.

That distinction turns out to be measurable, even mechanical. One line of work pins down a model's memorization *capacity* at roughly 3.6 bits per parameter; once that bucket fills, the model can't keep memorizing and a phase transition called grokking kicks in — the shift from storing examples to actually generalizing When do language models stop memorizing and start generalizing?. Memorization and generalization, in other words, aren't a matter of intent — they're what a finite model does before versus after it runs out of room to rote-store.

The failure side is illuminating. Chain-of-thought reasoning that looks like genuine generalization often isn't: it degrades predictably the moment you shift the task, length, or format away from training, producing fluent but logically broken steps — imitation of reasoning's *form* without the underlying logic Does chain-of-thought reasoning actually generalize beyond training data?. Drilling into where those errors come from, a token-level analysis found that 'local' memorization — leaning on the immediately preceding tokens rather than understanding — drives up to 67% of reasoning mistakes, and gets worse as problems get harder Where do memorization errors arise in chain-of-thought reasoning?. A similar caution comes from instruction tuning: models trained on semantically empty or even deliberately wrong instructions perform about as well as those trained on correct ones, suggesting what transfers is knowledge of the *output format*, not task understanding Does instruction tuning teach task understanding or output format?.

Here's the thing you might not have known you wanted to know: the field has partly stopped treating memorization and generalization as enemies and started treating them as components to allocate on purpose. Wide & Deep recommender models train a memorizing half (which captures rare, specific cases) and a generalizing half (embeddings for common patterns) jointly, so each can specialize and stay small Can one model memorize and generalize better than two?, Can one model handle both memorization and generalization?. The same routing instinct shows up in adaptation: splitting learning into slow parameter updates versus fast textual context lets task-specific lessons live in prompts while general weights stay stable — recasting catastrophic forgetting as a *misallocation* problem rather than an inevitable cost Can splitting adaptation into two channels reduce forgetting?. And whether a new fact even sticks is surprisingly predictable: pre-learning keyword probability forecasts whether priming happens, with a sharp threshold separating data that imprints from data that slides off Can we predict keyword priming before learning happens?. Across all of it, the dividing line is recurrence and reuse — data generalizes when the skill it carries shows up again elsewhere, and gets memorized when it doesn't.


Sources 9 notes

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

When do language models stop memorizing and start generalizing?

GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Where do memorization errors arise in chain-of-thought reasoning?

STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can one model memorize and generalize better than two?

Wide & Deep models train memorization (cross-product features) and generalization (embeddings) together, allowing each component to specialize: the wide part becomes small because deep handles common cases, and deep doesn't overfit rare items because wide captures them. Ensembling requires both halves full-size.

Can one model handle both memorization and generalization?

Wide & Deep architectures train a sparse cross-product tower and a dense embedding tower together, allowing the wide part to patch only the deep part's weaknesses. This joint approach requires smaller models than ensemble methods.

Can splitting adaptation into two channels reduce forgetting?

Fast-Slow Training routes task-specific lessons into optimized prompts while keeping parameter updates minimal, reaching equivalent performance 1.4–3x faster with substantially less catastrophic forgetting and plasticity loss, demonstrating that forgetting is a misallocation problem rather than an inherent cost.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Next inquiring lines