Why do trajectories matter more than individual examples for in-context learning?
Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.
In-context learning for supervised tasks works by providing a few input-output examples. Naively applying this to sequential decision making (providing a few state-action pairs) fails to enable ICL of new tasks. The key finding: the context must contain full or partial trajectories from the same environment level as the query — not just isolated examples. This property is called trajectory burstiness.
Why the difference matters: In supervised learning, examples can be from different instances — the model learns the function mapping. In sequential decision making, the model must generalize from the same level/environment to handle the wide range of states it may encounter at deployment. A sparse set of state-action pairs doesn't cover the state space; full trajectories do.
Trajectory burstiness is the probability that a given input sequence contains at least two trajectories from the same level. When this property is present in pre-training data, the model acquires the capacity to learn new tasks from demonstrations at inference time without weight updates.
Additional factors that increase ICL performance:
- Larger model and dataset size
- More task diversity in pre-training
- Environment stochasticity (forces generalization over trajectory variation)
- Higher trajectory burstiness in pre-training data
Generalization scope demonstrated: Train/test tasks differ greatly — different states, actions, dynamics, and reward functions. The model generalizes from, e.g., platform games to maze navigation from a handful of expert demonstrations. This is substantially harder than prior work that generalizes across reward function variants of the same environment.
The implication for dataset construction: sequential decision-making ICL requires a data distribution property (trajectory burstiness) that standard language modeling data does not naturally contain. This is a data structural requirement, not just a scale requirement.
This connects to Does training data format shape reasoning strategy more than domain? — here the structural property is at the trajectory level rather than the reasoning step level, but the principle is the same: data structure determines capability.
Source: Reasoning Architectures
Related concepts in this collection
-
Does training data format shape reasoning strategy more than domain?
What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
trajectory burstiness is another case where data structure determines emergent capability
-
What do models actually learn from chain-of-thought training?
When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
structural properties of training data drive learning; applies at both the reasoning trace and trajectory levels
-
Can we allocate inference compute based on prompt difficulty?
Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
the context-length requirements for trajectory-bursty inference raise per-query compute costs
-
Can LLMs handle multiple tasks at once during inference?
Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?
task superposition may be the representational mechanism enabling trajectory-bursty ICL: the model maintains multiple task interpretations from in-context trajectories simultaneously before committing to a single policy at generation time
-
Can transformers learn to solve new problems within episodes?
Explores whether RL-finetuned transformers can develop meta-learning abilities that let them adapt to unseen tasks through in-episode experience alone, without weight updates.
ICRL is the RL-trained capability that trajectory burstiness enables: same-level trajectories create the meta-learning pressure during training that ICRL exploits at inference time for adaptation to unseen environments
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
trajectory burstiness — same-level trajectories in context — is required for in-context learning of sequential decision-making across new tasks