What does a human-parseable framework for deep learning look like?

This explores what it would actually take to make deep learning legible to people — not just accurate, but structured in ways a human can inspect, reason about, and trust.

This explores what it would actually take to make deep learning legible to people — not just accurate, but structured in ways a human can inspect, reason about, and trust. The starting premise in the corpus is that legibility isn't optional polish: a human-parseable theory of deep learning is argued to be essential for safety oversight, because catching failure modes and validating explanations depends on humans having frameworks they can reason with — not on whether the AI can explain itself Can humans understand deep learning before AI does?. So the question is really: what does a network have to look like inside for a person to follow it?

One concrete answer is forced modularity. Train transformers with sparse weights and you get compact circuits where individual neurons map to simple concepts with clean connections — and ablation studies confirm those circuits are both necessary and sufficient for the task, not just decorative Can sparse weight training make neural networks interpretable by design?. Strikingly, this kind of structure also shows up without being engineered: pruning experiments reveal that networks naturally split compositional tasks into isolated subnetworks, each handling one function, with pretraining making that decomposition more consistent Do neural networks naturally learn modular compositional structure?. A parseable framework, then, might be less about imposing a diagram from outside and more about coaxing out the modular structure the network already tends toward.

But here's the part you might not have expected to care about: identical outputs can hide wildly different internals. The 'Fractured Entangled Representation' work shows networks that pass every test while their internal representations are incoherent — and standard benchmarks simply cannot see the difference Can AI pass every test while understanding nothing?. This is the deep reason accuracy alone can never be the framework. A theory-free, correlation-driven model can hit 95% accuracy and still be quietly committing causation errors that would wrongly convict thousands Can AI models be truly free from human bias?. Human-parseability is the antidote to mistaking a good score for understanding.

The corpus also hints that the most legible systems are often the ones with the strongest structural priors. A single-layer linear autoencoder forbidden from letting items predict themselves beats most deep collaborative-filtering models — because the constraint forces prediction through interpretable item relationships, and structural bias turns out to matter more than raw capacity Can a linear model beat deep collaborative filtering?. Even architecture choices carry this flavor: deep-and-thin networks win at small scale by composing abstract concepts layer by layer, a stacking you can narrate, rather than smearing parameters across width Does depth matter more than width for tiny language models?.

Put together, a human-parseable framework looks less like a single grand theory and more like a set of design moves that make structure surface: sparsity and constraints that force modularity, architectures whose computation composes in a readable order, and evaluation that probes internal coherence instead of trusting the output. The honest caveat — and the open frontier — is scale: interpretable circuits have only been maintained up to tens of millions of parameters, so whether this legibility survives at frontier scale is still unsolved Can sparse weight training make neural networks interpretable by design?.

Sources 7 notes

Can humans understand deep learning before AI does?

Deep learning theory must be developed in forms humans can reason about and evaluate, because human oversight of AI systems depends on frameworks for identifying failure modes and validating explanations—not on whether AI can self-explain.

Can sparse weight training make neural networks interpretable by design?

Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can AI pass every test while understanding nothing?

The Fractured Entangled Representation hypothesis shows that SGD-trained networks can produce identical outputs across all inputs while maintaining radically different internal representations. Standard benchmarks cannot detect this structural difference.

Can AI models be truly free from human bias?

Research shows that 'theory-free' AI models mask bigotry behind high accuracy metrics while committing fundamental statistical errors. A 95% accurate criminal justice system would wrongly convict thousands, demonstrating that model sophistication does not validate causal inference.

Can a linear model beat deep collaborative filtering?

ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

What does a human-parseable framework for deep learning look like?

Sources 7 notes

Next inquiring lines