Why does gradient descent discover compositional structure without explicit pressure?

This explores why networks trained by ordinary gradient descent end up with modular, compositional internal organization even though nothing in the loss function explicitly rewards modularity — and whether that 'discovery' is as real as it looks.

This explores why plain gradient descent tends to carve a problem into reusable parts without anyone telling it to — and the corpus gives you both the optimistic story and its sharp rebuttal in the same breath. On the optimistic side, pruning experiments show that networks genuinely implement compositional subroutines as isolated subnetworks: ablate one and only its corresponding function breaks, which means the modular structure is real and localizable, not a story we project onto a blob of weights Do neural networks naturally learn modular compositional structure?. Strikingly, the cleanest answer to 'why, without pressure?' may be: there is no special pressure needed. Standard MLPs reach compositional generalization through data and model scaling alone, provided the training distribution covers enough combinations of the underlying task pieces — no architectural priors, no symbolic scaffolding Can neural networks learn compositional skills without symbolic mechanisms?. The implicit pressure, in other words, comes from the data's own combinatorial structure, and gradient descent simply finds the compressed representation that exploits it.

There's even a geometric fingerprint of this happening. The Polar Probe finds that language models spontaneously encode syntactic type and direction in something like a polar coordinate system — distance and angle — without ever being trained to build a symbolic grammar How do language models encode syntactic relations geometrically?. So structure that looks designed keeps falling out of undirected optimization. A related thread argues this is what language models are doing at the deepest level: compressing purely relational structure (Saussure's *langue*) out of text, with meaning emerging from internal relations rather than external grounding Can language models learn meaning without engaging the world?. Composition, on this view, is just what efficient compression of a relational world looks like.

But the corpus refuses to let you celebrate too early, and this is the part you probably didn't come looking for. 'Linearly decodable' is not the same as 'well-organized.' Models trained with SGD can contain every feature a task needs in cleanly readable form while their actual internal organization is fractured — perfectly accurate, yet quietly brittle under perturbation and distribution shift in ways standard metrics never reveal Can models be smart without organized internal structure?. Worse, what looks like learned composition in transformers may be memorized computation subgraphs: the model linearizes and matches subgraphs seen in training, sails through in-distribution tests, then collapses on genuinely novel combinations with errors compounding across steps Do transformers actually learn systematic compositional reasoning?. So gradient descent often 'discovers' the *appearance* of compositional structure — pattern coverage — rather than the recombinable rules themselves.

The deeper diagnosis is the binding problem: networks struggle to segregate entities, keep their representations separate, and reuse learned structure in new arrangements — which is precisely why systematic generalization fails, and why scaling only *partially* rescues it by making compositional representations more likely to emerge Why do neural networks fail at compositional generalization?. That reframes the whole question. Gradient descent doesn't discover compositionality because of a hidden bias toward modularity; it discovers it when, and only when, the data makes a compositional solution the cheapest one to compress — and when it can't, it fakes it convincingly.

If you want to see what happens when you stop leaving this to chance, two notes flank the spectrum. Forcing the issue with sparse-weight training produces genuinely disentangled, human-readable circuits where neurons map to single concepts — modularity by explicit design rather than emergence, at the cost of scale Can sparse weight training make neural networks interpretable by design?. And at the other extreme, you can abandon a single descent trajectory entirely: swarms of model 'particles' searching weight space can compose new experts that none of the originals possessed, finding capability through collaborative search instead of gradient pressure at all Can language models discover new expertise through collaborative weight search?. Between forced sparsity and gradient-free search sits the real answer to your question: gradient descent finds compositional structure when the loss landscape rewards it for free, and the open research frontier is telling that genuine discovery apart from its very convincing imitation.

Sources 9 notes

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

How do language models encode syntactic relations geometrically?

The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.

Can language models learn meaning without engaging the world?

Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do transformers actually learn systematic compositional reasoning?

Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.

Why do neural networks fail at compositional generalization?

Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Why does gradient descent discover compositional structure without explicit pressure?

Sources 9 notes

Next inquiring lines