LLM Reasoning and Architecture

Can longer task training help shorter tasks extrapolate?

When models train on related tasks at different lengths, does solving a longer auxiliary task enable a shorter main task to generalize beyond its training length? This matters for understanding how neural networks transfer learned capabilities across related problems.

Note · 2026-02-23 · sourced from Context Engineering
What kind of thing is an LLM really? How should we allocate compute budget at inference time?

The "Extrapolation by Association" paper demonstrates a specific mechanism for out-of-distribution generalization: length generalization — the ability to handle longer inputs than seen during training — can transfer from one task to another.

The setup: train multiple related tasks jointly, where an "auxiliary task" uses longer inputs and a "main task" uses shorter inputs. The finding: the main task generalizes to the length of the longer auxiliary task, even though it was never trained at that length. This works across arithmetic operations, string transformations, and maze navigation — diverse algorithmic domains sharing an underlying structural similarity.

The mechanistic evidence is precise: length generalization transfer correlates with the reuse of the same attention heads between tasks. The model doesn't learn separate length-handling circuitry per task. Instead, it develops shared computational infrastructure that handles the length dimension, and this infrastructure transfers because the related tasks route through the same attention heads.

The pretrained-model finding extends this further: pretrained language models already exhibit similar transfer effects, suggesting that pretraining equips models with "reusable computational scaffolding" that facilitates extrapolation in downstream settings. The scaffolding is not task-specific — it is a general capability for processing longer sequences that was acquired during pretraining and can be activated by fine-tuning on related tasks.

This connects to Do base models already contain hidden reasoning ability? through a shared principle: pretraining installs capabilities that later training surfaces rather than creates. The base model already has the computational scaffolding for length handling; the auxiliary task merely activates it for the main task.

The connection to Do neural networks naturally break tasks into modular parts? is direct: attention head reuse across tasks is a specific instance of modular subnetwork sharing. The decomposition into reusable modules happens naturally, and pretraining encourages it — exactly the compositional generalization thesis applied to the length dimension.

Since Can neural networks learn compositional skills without symbolic mechanisms?, length generalization may follow the same scaling trajectory — more data and larger models produce more transferable attention head circuits. The practical implication: training on a diverse set of related tasks at varying lengths may be more efficient than training each task independently at the target length.


Source: Context Engineering

Related concepts in this collection

Concept map
13 direct connections · 120 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

length generalization transfers across related tasks via shared attention head reuse — pretraining provides reusable computational scaffolding