What happens to representational structure during model pretraining phases?
This explores what actually changes inside a model's internal representations as it learns during pretraining — whether structure like density, modularity, and feature hierarchy is built up by exposure or is intrinsic.
This explores what actually happens to a model's internal representations as it learns during pretraining — and the corpus tells a surprisingly consistent story: structure is grown through exposure, not handed to the model at initialization. The clearest example is that representational density is *learned*. As a model sees more of its training data, it develops dense, richly-populated activations for the familiar stuff and falls back to sparse representations for anything unfamiliar — a pattern that emerges on its own, without any task-specific fine-tuning Is representational sparsity learned or intrinsic to neural networks?. The flip side shows up at inference: when a trained model hits an out-of-distribution task, it deliberately sparsifies its hidden states as an adaptive filter, suggesting the dense/sparse axis isn't a bug but a learned coping mechanism for unfamiliarity Do language models sparsify their activations under difficult tasks?.
Beyond density, pretraining also organizes representations *structurally*. Circuit-tracing inside Claude models reveals a four-tier hierarchy — features climb from raw token inputs, to abstract concepts, to functional operations, to outputs — and larger models grow richer abstract tiers, meaning scale buys higher-level conceptual machinery rather than just more memorization How do language models organize features across processing layers?. In parallel, networks carve compositional tasks into isolated modular subnetworks you can ablate independently, and pretraining is what makes that modularity reliable and consistent across architectures Do neural networks naturally learn modular compositional structure?. So the picture is not a uniform blob getting bigger — it's density, hierarchy, and modular specialization all consolidating together.
Here's the part that reframes the question: a lot of what we associate with later "training" is already laid down during pretraining. Several independent lines of evidence show base models already carry latent reasoning capability in their activations — RL, fine-tuning, decoding tricks, and SAE feature steering all merely *elicit* it rather than create it Do base models already contain hidden reasoning ability?. That's why RL post-training looks more like teaching a model *when* to reason than *how* Does RL post-training create reasoning or just deploy it?. Some researchers are pushing reasoning structure even earlier, treating chain-of-thought as an exploratory action *during* pretraining with an information-gain reward, and seeing real benchmark lifts — a direct vote that representational structure for reasoning can be planted in the pretraining phase itself Can chain-of-thought reasoning be learned during pretraining itself?.
The corollary worth knowing: because so much structure is baked in during pretraining, *how* you touch it afterward matters enormously. Direct weight fine-tuning corrupts knowledge stored in lower layers, while decoding-time proxy tuning leaves those base weights untouched and preserves knowledge far better — evidence that the lower-layer representational scaffold built during pretraining is both valuable and fragile Can decoding-time tuning preserve knowledge better than weight fine-tuning?. The same pretrained priors can even override what's in front of the model: when parametric associations are strong enough, the model ignores its own context, and only intervening directly in the representations breaks the grip Why do language models ignore information in their context?.
The thing you might not have known you wanted to know: pretraining isn't just absorbing facts — it's quietly building a layered, modular, density-graded representational geometry that already contains latent skills, and most of what happens later is selecting and routing through that geometry rather than rewriting it.
Sources 9 notes
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
As task difficulty increases, LLM hidden states become substantially sparser in a localized, systematic way that correlates with task unfamiliarity and reasoning load. This sparsification acts as a selective filter stabilizing performance under OOD shift rather than a failure mode.
Circuit tracing in Claude models reveals features progress from token-level inputs to abstract concepts to functional operations to outputs. Larger models develop richer abstract features, suggesting scaling enables higher-level conceptual reasoning rather than pattern memorization.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.