How do classical mechanics and statistical mechanics provide methodological templates for learning theory?
This explores how physics — both the clockwork predictability of classical mechanics and the average-over-many-particles logic of statistical mechanics — is being borrowed as a model for how to build a theory of deep learning.
This explores how physics — both the clockwork predictability of classical mechanics and the average-over-many-particles logic of statistical mechanics — is being borrowed as a model for how to build a theory of deep learning. The clearest statement in the corpus is the idea of "learning mechanics" as an emerging unifying frame Can deep learning theory unify around training dynamics?. The move it makes is a methodological one, not just a metaphor: classical mechanics gives you trajectories — track how a system evolves step by step — while statistical mechanics tells you to stop tracking individual particles and instead predict the aggregate, average-case behavior of a huge population. Applied to neural networks, that means studying training dynamics over time and reasoning about typical-case outcomes, rather than chasing the worst-case bounds that older learning theory prized.
Why does this template fit? Because deep networks behave like the macroscopic systems physics was built to handle: billions of parameters whose individual values nobody can or wants to track, but whose collective behavior is strikingly regular. Several notes show that regularity emerging on its own. Networks spontaneously break compositional tasks into isolated, modular subnetworks Do neural networks naturally learn modular compositional structure?, and compositional generalization simply appears once you scale data and model size enough to cover the task space Can neural networks learn compositional skills without symbolic mechanisms?. That's exactly the statistical-mechanics promise: order arising from scale, predictable in aggregate even when any single component is opaque.
The template also reframes what counts as a measurable quantity. Just as thermodynamics gave physics entropy and energy as the right variables, learning theory is hunting for its own. "Epiplexity" tries to measure the structural information a resource-bounded observer can actually extract from data — separating learnable regularity from noise, and predicting which datasets transfer broadly What can a bounded observer actually learn from data?. Energy-Based Transformers go further and literally import the physics object: they assign an energy to each input-prediction pair and do inference by gradient-descending toward low energy, getting better generalization without domain-specific scaffolding Can energy minimization unlock reasoning without domain-specific training?. Reasoning-as-energy-minimization is the statistical-mechanics analogy taken fully literally.
It's worth seeing the limits and the rivals, though, because the physics template isn't the only game. A competing tradition borrows from cognitive science rather than physics — Marr's three levels of analysis offer a different structured toolkit for explaining networks Can cognitive science methods unlock how LLMs actually work?, and a paired representational-plus-causal method argues that aggregate statistics alone leave you with correlations, not mechanisms Can we understand LLM mechanisms with only representational analysis?. There's also a hard ceiling the averaging view can run into: the binding problem suggests some failures are architectural, not statistical, and won't dissolve with scale Why do neural networks fail at compositional generalization?. The interesting takeaway is that learning theory may end up doing what physics itself does — running a fast, average-case statistical account alongside a slower, mechanistic one, and arguing about where each applies.
Sources 8 notes
Research shows learning mechanics is consolidating as a unified frame for deep learning, modeled on classical and statistical mechanics. It prioritizes average-case predictions, training dynamics, and aggregate statistics over worst-case bounds, mirroring how physics addresses macroscopic systems.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.
Epiplexity formalizes the structural information a computationally bounded observer can extract from data, separating learnable regularity from time-bounded entropy. This task-free measure correlates with out-of-distribution generalization and explains why some datasets enable broader transfer than others.
Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.
Cognitive science's 70-year toolkit of behavioral probes, causal interventions, and representational analysis transfers directly to LLM interpretation. Marr's computational, algorithmic, and implementation levels reframe the problem structurally and enable layered rather than monolithic explanation.
Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.
Greff et al. argue that neural networks cannot dynamically bind distributed information into compositional structures due to three failures: segregating entities from inputs, maintaining representational separation, and reusing learned structure in novel combinations. Scaling can partially overcome this by enabling compositional representations to emerge.