Can longer task training help shorter tasks extrapolate?
When models train on related tasks at different lengths, does solving a longer auxiliary task enable a shorter main task to generalize beyond its training length? This matters for understanding how neural networks transfer learned capabilities across related problems.
The "Extrapolation by Association" paper demonstrates a specific mechanism for out-of-distribution generalization: length generalization — the ability to handle longer inputs than seen during training — can transfer from one task to another.
The setup: train multiple related tasks jointly, where an "auxiliary task" uses longer inputs and a "main task" uses shorter inputs. The finding: the main task generalizes to the length of the longer auxiliary task, even though it was never trained at that length. This works across arithmetic operations, string transformations, and maze navigation — diverse algorithmic domains sharing an underlying structural similarity.
The mechanistic evidence is precise: length generalization transfer correlates with the reuse of the same attention heads between tasks. The model doesn't learn separate length-handling circuitry per task. Instead, it develops shared computational infrastructure that handles the length dimension, and this infrastructure transfers because the related tasks route through the same attention heads.
The pretrained-model finding extends this further: pretrained language models already exhibit similar transfer effects, suggesting that pretraining equips models with "reusable computational scaffolding" that facilitates extrapolation in downstream settings. The scaffolding is not task-specific — it is a general capability for processing longer sequences that was acquired during pretraining and can be activated by fine-tuning on related tasks.
This connects to Do base models already contain hidden reasoning ability? through a shared principle: pretraining installs capabilities that later training surfaces rather than creates. The base model already has the computational scaffolding for length handling; the auxiliary task merely activates it for the main task.
The connection to Do neural networks naturally break tasks into modular parts? is direct: attention head reuse across tasks is a specific instance of modular subnetwork sharing. The decomposition into reusable modules happens naturally, and pretraining encourages it — exactly the compositional generalization thesis applied to the length dimension.
Since Can neural networks learn compositional skills without symbolic mechanisms?, length generalization may follow the same scaling trajectory — more data and larger models produce more transferable attention head circuits. The practical implication: training on a diverse set of related tasks at varying lengths may be more efficient than training each task independently at the target length.
Source: Context Engineering
Related concepts in this collection
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
shared principle: pretraining installs capabilities that later training surfaces
-
Do neural networks naturally break tasks into modular parts?
Can standard neural networks decompose complex tasks into separate subroutines implemented in distinct subnetworks, or do they only memorize input-output patterns? Understanding whether compositionality emerges from gradient-based learning matters for interpretability and generalization.
attention head reuse is a concrete instance of modular subnetwork sharing
-
Can neural networks learn compositional skills without symbolic mechanisms?
Do neural networks need explicit symbolic architecture to compose learned concepts, or can scaling alone enable compositional generalization? This asks whether compositionality is an architectural feature or an emergent property of scale.
length generalization may share the same scaling dynamics
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
length generalization transfers across related tasks via shared attention head reuse — pretraining provides reusable computational scaffolding