Does representational density emerge from training data exposure during pretraining?
This explores whether the 'density' of a model's internal representations — how richly it activates for a given input — is something built up through seeing data during pretraining, rather than baked into the architecture from the start.
This explores whether the 'density' of a model's internal representations — how richly it activates for a given input — is something built up through seeing data during pretraining, rather than baked into the architecture from the start. The corpus answers directly: yes, density is learned through familiarity. Networks develop dense activations for inputs that resemble their training data and fall back to sparse representations for unfamiliar ones, and this split emerges during pretraining itself, before any task-specific tuning Is representational sparsity learned or intrinsic to neural networks?. Density isn't a fixed property of the network — it's a fingerprint of exposure.
What makes this interesting is how cleanly it rhymes with a whole cluster of findings about pretraining being the decisive formative stage. Cognitive biases turn out to work the same way: models that share a pretrained backbone show the same bias patterns regardless of what finetuning data they later see, so the biases are planted in pretraining and only nudged afterward Where do cognitive biases in language models come from?. Even reasoning ability seems to be present in base-model activations already — post-training selects and elicits it rather than creating it Do base models already contain hidden reasoning ability?. The recurring theme: the substance is laid down by exposure during pretraining, and later stages mostly steer what's already there.
The familiarity mechanism gets even more concrete when you look at how predictable it is. Whether a keyword gets 'primed' after learning is strongly predictable from its probability *before* learning, with a sharp threshold around 10^-3 and as few as three exposures enough to lock the effect in Can we predict keyword priming before learning happens?. That's the same story as representational density at a finer grain: prior exposure determines how the model lights up. And it has practical teeth — you can read the statistics of pretraining data to predict failure. Entity co-occurrence patterns from training data flag hallucination risk better than the model's own confidence, because the root cause is unseen *combinations* of things the model never densely encoded Can pretraining data statistics detect hallucinations better than model confidence?.
But exposure isn't monolithic — *what kind* of knowledge you absorb matters. Analysis of five million pretraining documents shows reasoning leans on broad, transferable procedural knowledge drawn from many sources, while factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. So 'density from exposure' isn't one uniform dial; familiar procedures generalize, while familiar facts stay pinned to where they were seen. That's a useful corrective to a simple 'more data = denser everywhere' picture.
The flip side worth knowing: if density is learned, you can also damage it. Direct fine-tuning corrupts knowledge stored in lower layers, while decoding-time approaches that leave base weights untouched preserve that pretrained knowledge far better Can decoding-time tuning preserve knowledge better than weight fine-tuning?. In other words, the dense representations pretraining builds are an asset that aggressive post-training can erode — which is exactly why so much recent work tries to stay close to the base distribution rather than overwrite it.
Sources 7 notes
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.
QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.