Why should deep learning theory prioritize average-case over worst-case analysis?

This explores why deep learning theory is shifting toward predicting how models typically behave (average-case, statistical) rather than proving guarantees about the rarest failure (worst-case bounds), and what that shift buys you.

This explores why deep learning theory is moving away from worst-case guarantees toward average-case, statistics-driven prediction — and the corpus frames this less as a concession than as a deliberate change of physics. The clearest articulation is the emergence of "learning mechanics" as a unifying frame: it models networks the way classical and statistical mechanics model gases, caring about aggregate behavior and training dynamics rather than the pathological single particle Can deep learning theory unify around training dynamics?. Worst-case bounds in deep learning are notoriously loose and pessimistic — they describe adversarial corners the model almost never visits in practice — so a theory built on them predicts little about the models people actually train and deploy.

What makes the average-case approach earn its keep is that the field's most useful recent results are empirical regularities, not guarantees. You can predict where an LLM will struggle by treating it as a probability machine and asking which targets sit in low-probability regions — backwards alphabet, letter counting — a typical-case prediction that worst-case analysis would never surface because logically those tasks are trivial Can we predict where language models will fail?. Similarly, the finding that depth beats width below a billion parameters Does depth matter more than width for tiny language models?, or that RL post-training reliably collapses onto one dominant pretraining format within a single epoch Does RL training collapse format diversity in pretrained models?, are statements about what training dynamics *tend* to do — the kind of aggregate regularity a mechanics-style theory is built to explain.

The deeper payoff is that average-case thinking redirects attention to *structure* over capacity. Several notes show that the interesting story lives in how representations organize themselves during typical training, not in capacity bounds: networks spontaneously sink compositional subroutines into isolated subnetworks Do neural networks naturally learn modular compositional structure?, activations grow dense for familiar data and stay sparse for the unfamiliar Is representational sparsity learned or intrinsic to neural networks?, and even a humble linear model can beat deep collaborative filtering when the right structural constraint is imposed Can a linear model beat deep collaborative filtering?. These are average-case structural facts that worst-case capacity analysis is blind to.

Here's the twist worth carrying away: average-case can be *too* forgiving if you only look at output. Two networks can produce identical outputs while one carries clean, transferable structure and the other carries a fractured, entangled mess that breaks the moment you push it toward novel contexts Can identical outputs hide broken internal representations?. So the real argument isn't "average-case instead of worst-case" — it's that the right unit of analysis is *distributional and dynamical*: what the model does across the data it actually meets and the trajectory it actually takes through training. That's also why pushing into the extreme tail backfires in practice — training on near-impossible RLVR problems teaches degenerate shortcuts rather than reasoning Do overly hard RLVR samples actually harm model capabilities?. The worst case isn't just hard to bound; chasing it can actively corrupt the typical case.

Sources 9 notes

Can deep learning theory unify around training dynamics?

Research shows learning mechanics is consolidating as a unified frame for deep learning, modeled on classical and statistical mechanics. It prioritizes average-case predictions, training dynamics, and aggregate statistics over worst-case bounds, mirroring how physics addresses macroscopic systems.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can a linear model beat deep collaborative filtering?

ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.

Can identical outputs hide broken internal representations?

Networks trained with SGD reproduce outputs perfectly while having radically different internal structure than evolved networks, with weight perturbations revealing fractured, entangled representations that prevent transfer to novel contexts or creative recombination.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Why should deep learning theory prioritize average-case over worst-case analysis?

Sources 9 notes

Next inquiring lines