How do models develop dense representations for familiar training data?

This explores what actually happens inside a model when it meets data it has seen a lot of during training — and why familiarity shows up as denser, busier internal activity.

This explores what actually happens inside a model when it meets data it has seen a lot of during training — and why familiarity shows up as denser, busier internal activity. The corpus has a surprisingly direct answer: density isn't baked into the architecture, it's earned. During pretraining, networks build up rich, dense activation patterns for inputs they've encountered repeatedly, and fall back to sparse, thinned-out representations for anything unfamiliar — and this happens on its own, without any task-specific fine-tuning, simply as a side effect of repeated exposure Is representational sparsity learned or intrinsic to neural networks?. Density, in other words, is a fingerprint of how well-trodden a piece of input is.

The flip side makes the picture sharper. When a model hits something out-of-distribution — a hard or unfamiliar task — its hidden states deliberately sparsify, and this isn't a breakdown but a kind of selective filtering that keeps performance stable under unfamiliar load Do language models sparsify their activations under difficult tasks?. So the same dial runs in both directions: dense for the familiar, sparse for the strange. Familiarity and difficulty sit at opposite ends of one learned spectrum.

There's a capacity story underneath all this. Models don't densify forever — there's a measurable ceiling of roughly 3.6 bits per parameter, and once memorization fills that budget a phase transition ("grokking") flips the model from storing specific examples toward genuine generalization When do language models stop memorizing and start generalizing?. Dense representations for familiar data are part of how that budget gets spent — consolidation, then, is a finite resource being allocated, not an infinite sponge.

What's interesting is how *structured* this consolidation turns out to be. Pretraining doesn't just thicken activations uniformly; it sorts knowledge into modular subnetworks, where pruning experiments show distinct compositional subroutines living in isolated parts of the network — and pretraining makes that modularity markedly more consistent Do neural networks naturally learn modular compositional structure?. Depth plays a role too: deep-and-thin small models outperform wide ones precisely because layers let abstract concepts compose on top of each other rather than spreading thin Does depth matter more than width for tiny language models?. So "dense representation" isn't a blur — it's layered, modular, and concept-shaped.

The sting in the tail: these consolidated representations are sticky, sometimes too sticky. Strongly-trained parametric knowledge can override what's sitting right in the model's context window, so a model ignores fresh information because its priors won't budge — and plain prompting can't fix it Why do language models ignore information in their context?. That same fragility is why *how* you fine-tune matters: directly rewriting weights corrupts knowledge stored in lower layers, whereas decoding-time approaches like proxy-tuning leave the consolidated base untouched and steer only style and reasoning Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The dense representations a model builds for familiar data are valuable enough that the best interventions are the ones that don't disturb them.

Sources 7 notes

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Do language models sparsify their activations under difficult tasks?

As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.

When do language models stop memorizing and start generalizing?

GPT-family models have a measurable memorization capacity of approximately 3.6 bits-per-parameter. When this capacity fills, a phase transition triggers grokking—the shift from memorization to genuine generalization. This capacity is a property of individual models, not training algorithms.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

How do models develop dense representations for familiar training data?

Sources 7 notes

Next inquiring lines