Which hyperparameter theories best explain universal behaviors across neural networks?

This reads as: what theoretical frame best explains the behaviors that show up again and again across different neural networks, regardless of architecture — and the corpus points less toward 'hyperparameter tuning' recipes than toward theories of training dynamics, scaling, and inductive bias.

This explores which theory best accounts for the regularities that recur across neural networks of different shapes and sizes. Worth flagging up front: the corpus doesn't frame this as 'hyperparameter theory' in the tuning sense (learning rates, batch sizes). The strongest candidate it offers is a different thing — a theory of *training dynamics*. The clearest statement of this is the argument that deep learning theory is consolidating around what one line of work calls 'learning mechanics' Can deep learning theory unify around training dynamics?. The pitch borrows from physics: instead of chasing worst-case guarantees, you model the average-case behavior of training as a dynamical system, the way statistical mechanics describes a gas without tracking every molecule. That reframing is what lets a single theory explain behavior across architectures rather than one network at a time.

What would such a theory have to explain? The corpus gives a surprisingly consistent list of universal behaviors. Networks spontaneously break tasks into modular subnetworks, so that pruning one piece knocks out exactly one sub-function — and pretraining makes this modularity *more* reliable across architectures and domains Do neural networks naturally learn modular compositional structure?. They learn to be dense on familiar data and sparse on unfamiliar inputs, a pattern that emerges from exposure alone with no task-specific tuning Is representational sparsity learned or intrinsic to neural networks?. And plain MLPs reach compositional generalization through scale rather than special architecture, as long as the training data covers the combinations Can neural networks learn compositional skills without symbolic mechanisms?. Three different phenomena, all emerging from training dynamics rather than from anything hand-designed — exactly the kind of regularity a learning-mechanics theory wants to predict.

But here's the twist the corpus keeps returning to: the universal behavior that matters most is often set by *inductive bias*, not raw expressiveness or hyperparameters. A famous result shows MLPs can in theory match a dot-product similarity but in practice fail to, demanding absurd capacity to approximate something a simple geometric operation does for free Can MLPs learn to match dot product similarity in practice?. Likewise, depth beats width at small scale — deep-thin models compose abstract concepts through layers in a way that contradicts the tidy width-based scaling predictions you'd expect from Kaplan-style laws Does depth matter more than width for tiny language models?. So a theory of 'which knob explains everything' keeps running into the fact that *shape* and *bias* dominate the knobs.

The scaling-law family is the other contender for a universal explanation, and the corpus extends it in a direction you might not expect. The same diminishing-returns curve that governs reasoning tokens also governs how many search steps a research agent should take — a 'search budget law' that mirrors test-time scaling, suggesting these curves are a property of the inference process itself, not of any one model Do search steps follow the same scaling rules as reasoning tokens?. That's the seductive promise of scaling theory: one curve, many systems.

The quiet warning underneath all of this: universal *outputs* don't imply universal *internals*. The Fractured Entangled Representation hypothesis shows that two networks can produce identical answers on every test while organizing their internal structure completely differently — and no standard benchmark can tell them apart Can AI pass every test while understanding nothing?. If you want the deeper rabbit hole, that's it: any theory claiming to explain 'universal behavior' has to confront the possibility that the behavior is universal while the mechanism producing it is not. Learning mechanics is the corpus's best bet precisely because it studies the process, not just the scoreboard.

Sources 8 notes

Can deep learning theory unify around training dynamics?

Research shows learning mechanics is consolidating as a unified frame for deep learning, modeled on classical and statistical mechanics. It prioritizes average-case predictions, training dynamics, and aggregate statistics over worst-case bounds, mirroring how physics addresses macroscopic systems.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can neural networks learn compositional skills without symbolic mechanisms?

Standard MLPs achieve compositional generalization through data and model scaling alone, without architectural modifications, provided the training distribution sufficiently covers combinations of task modules. Linear decodability of constituents from hidden activations reliably predicts success.

Can MLPs learn to match dot product similarity in practice?

Rendle et al. show that carefully tuned dot products substantially outperform learned MLP similarities in collaborative filtering. MLPs require excessive capacity and data to match simple geometric similarity, and they cannot be efficiently retrieved at scale—proving inductive bias matters more than expressiveness.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Do search steps follow the same scaling rules as reasoning tokens?

Deep research agents improve with more search steps in a pattern mirroring the reasoning-token relationship, with both exhibiting diminishing returns. This reveals a new inference-compute axis beyond model capability alone.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Which hyperparameter theories best explain universal behaviors across neural networks?

Sources 8 notes

Next inquiring lines