How does adjacent layer sharing differ from non-adjacent weight reuse?

This explores weight sharing in transformers — reusing the same parameters in more than one place — and specifically why it matters whether the reused layers sit next to each other or far apart in the stack.

This explores weight sharing in transformers — reusing the same parameters in more than one place — and specifically whether it matters that the reused layers are neighbors versus spread across the network. The corpus has a sharp, concrete answer for the adjacent case and treats the non-adjacent case mostly by implication, so it's worth saying that up front.

The clearest finding is that adjacent sharing wins for a reason that has nothing to do with the math of the model and everything to do with the hardware. MobileLLM shares weights between two consecutive transformer blocks by running the same block twice in a row — and on memory-bound mobile devices, recomputing that block costs *less* than fetching a separate block's weights from memory Does recomputing weights cost less than moving them on mobile?. The weights for the block you just ran are already sitting in fast cache. Reuse them immediately and you skip a slow memory trip entirely. That locality is the whole trick. Non-adjacent reuse — pulling the same weights back in after several other layers have run — breaks it, because by then those weights have been evicted and you pay the memory-movement cost you were trying to avoid. So the difference isn't really 'adjacent vs. non-adjacent layers'; it's 'recompute-from-cache vs. re-fetch-from-memory.'

There's a deeper architectural reason adjacency is a natural fit, too. The same MobileLLM work shows that for small models, depth beats width — stacking more layers lets the network compose abstract concepts step by step, which matters more than spreading parameters sideways Does depth matter more than width for tiny language models?. If consecutive layers are doing closely related compositional work, reusing a block across that short span costs you little in capability while doubling effective depth for free. Reuse across distant layers asks one set of weights to do two unrelated jobs, which is a harder bargain.

The broader theme the corpus keeps circling is that the real bottleneck in modern models is often compute-vs-memory movement, not raw capacity — long-context work makes the same point from a different angle, finding the limiting factor is the compute to consolidate information into internal state rather than the memory to hold it Is long-context bottleneck really about memory or compute?. Adjacent sharing is one clean exploit of that asymmetry: trade cheap recomputation for expensive data movement. One honest caveat — there's a risk in any aggressive reuse that a model looks fine on metrics while its internal representations are quietly fractured, a failure standard evaluation misses Can models be smart without organized internal structure?. The corpus doesn't have a dedicated study comparing adjacent and non-adjacent reuse head-to-head, so if that exact comparison is what you're after, this is where the trail goes cold — but the locality argument tells you why the adjacent case is the one that pays off.

Sources 4 notes

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Does depth matter more than width for tiny language models?

MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.

Is long-context bottleneck really about memory or compute?

Research shows the bottleneck is not memory capacity but the compute required to consolidate evicted context into fast weights during offline sleep phases. Performance improves with more consolidation passes, following a test-time scaling pattern on harder reasoning tasks.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

How does adjacent layer sharing differ from non-adjacent weight reuse?

Sources 4 notes

Next inquiring lines