How does layer removal affect transformers compared to ResNets?
This explores what happens when you delete layers from a transformer — and whether the corpus supports the well-known finding that transformers (like ResNets) tolerate layer removal because their residual connections make each layer an incremental edit rather than a load-bearing pillar.
This explores what happens when you delete layers from a transformer versus a ResNet — and here's the honest framing up front: the collection has no note that runs the head-to-head layer-removal experiment on both architectures, so I can't hand you that direct comparison. What the corpus *does* hold is the mechanism that makes the comparison interesting in the first place, scattered across several notes that never mention ResNets by name.
The shared ingredient is the residual stream. One note reframes the transformer's residual pathway as a channel where knowledge is a continuous *flow* of activations rather than something stored in any single layer Do transformer models store knowledge or generate it continuously?. That's the same architectural trick ResNets introduced: each layer reads the running sum, adds a small correction, and writes it back. When every layer is an incremental edit on a shared bus rather than an irreplaceable stage in a pipeline, removing one layer perturbs the sum a little instead of severing it — which is exactly why both families degrade gracefully under ablation instead of breaking outright.
The corpus also suggests *why* some layers are more deletable than others. Adjacent transformer blocks turn out to be redundant enough that you can share weights between them — recomputing one block twice in place of fetching a second — with no accuracy loss Does recomputing weights cost less than moving them on mobile?. If neighboring blocks are that interchangeable, they're also the cheapest to remove. Against that, layers do carry distinct jobs: models compute correct answers in early layers and then overwrite them downstream Do transformers hide reasoning before producing filler tokens?, and multi-hop reasoning is built up in developmental stages across depth How do transformers learn to reason across multiple steps?. So removal isn't uniform — deleting a redundant middle block is survivable, deleting the early layers where the computation actually happens is not.
The cleanest evidence for *localized* removal effects comes from pruning experiments showing neural networks decompose tasks into modular subnetworks, where ablating one subnetwork knocks out only its specific function and leaves the rest intact Do neural networks naturally learn modular compositional structure?. That modularity — strengthened by pretraining and observed across architectures — is the structural reason removal tends to produce graceful, targeted degradation rather than collapse.
The twist worth taking away: depth is not a smooth dial. Scaling self-supervised RL networks toward 1000 layers shows capabilities switching on at *critical thresholds* — walking appears at depth 16, wall-climbing at depth 256 — in discontinuous jumps Does network depth unlock qualitatively new behaviors in RL?. The implication for layer removal is sharper than 'fewer layers, slightly worse': if a behavior only exists above a depth threshold, pulling layers below that line doesn't dim the capability, it deletes it cliff-edge. Graceful most of the time, catastrophic right at the threshold — and the corpus gives you the residual-stream mechanism to understand both regimes.
Sources 6 notes
Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.
MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Controlled training reveals transformers learn multi-hop reasoning in three phases: memorization, in-distribution generalization, and cross-distribution reasoning. Successful reasoning correlates with cosine clustering of entity representations, and second-hop generalization requires explicit compositional exposure during training.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Scaling to 1000-layer networks in self-supervised RL produces dramatic capability jumps at specific thresholds—depth 16 enables walking, depth 256 enables wall-climbing—driven by synergistic gains in both exploration and expressivity rather than gradual improvement.