Reinforcement Learning for LLMs LLM Reasoning and Architecture

Why do trajectories matter more than individual examples for in-context learning?

Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.

Note · 2026-02-22 · sourced from Reasoning Architectures

In-context learning for supervised tasks works by providing a few input-output examples. Naively applying this to sequential decision making (providing a few state-action pairs) fails to enable ICL of new tasks. The key finding: the context must contain full or partial trajectories from the same environment level as the query — not just isolated examples. This property is called trajectory burstiness.

Why the difference matters: In supervised learning, examples can be from different instances — the model learns the function mapping. In sequential decision making, the model must generalize from the same level/environment to handle the wide range of states it may encounter at deployment. A sparse set of state-action pairs doesn't cover the state space; full trajectories do.

Trajectory burstiness is the probability that a given input sequence contains at least two trajectories from the same level. When this property is present in pre-training data, the model acquires the capacity to learn new tasks from demonstrations at inference time without weight updates.

Additional factors that increase ICL performance:

Larger model and dataset size
More task diversity in pre-training
Environment stochasticity (forces generalization over trajectory variation)
Higher trajectory burstiness in pre-training data

Generalization scope demonstrated: Train/test tasks differ greatly — different states, actions, dynamics, and reward functions. The model generalizes from, e.g., platform games to maze navigation from a handful of expert demonstrations. This is substantially harder than prior work that generalizes across reward function variants of the same environment.

The implication for dataset construction: sequential decision-making ICL requires a data distribution property (trajectory burstiness) that standard language modeling data does not naturally contain. This is a data structural requirement, not just a scale requirement.

This connects to Does training data format shape reasoning strategy more than domain? — here the structural property is at the trajectory level rather than the reasoning step level, but the principle is the same: data structure determines capability.

Source: Reasoning Architectures

Related concepts in this collection

Does training data format shape reasoning strategy more than domain? What explains why models trained on multiple-choice data reason differently than those trained on free-form text? The research isolates format and domain effects to measure which one matters more.
trajectory burstiness is another case where data structure determines emergent capability
What do models actually learn from chain-of-thought training? When models train on reasoning demonstrations, do they memorize content details or absorb reasoning structure? Testing with corrupted data reveals which aspects of CoT samples actually drive learning.
structural properties of training data drive learning; applies at both the reasoning trace and trajectory levels
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
the context-length requirements for trajectory-bursty inference raise per-query compute costs
Can LLMs handle multiple tasks at once during inference? Do language models maintain multiple distinct in-context learning tasks simultaneously in their internal representations, and if so, what prevents them from actually generating outputs for more than one task?
task superposition may be the representational mechanism enabling trajectory-bursty ICL: the model maintains multiple task interpretations from in-context trajectories simultaneously before committing to a single policy at generation time
Can transformers learn to solve new problems within episodes? Explores whether RL-finetuned transformers can develop meta-learning abilities that let them adapt to unseen tasks through in-episode experience alone, without weight updates.
ICRL is the RL-trained capability that trajectory burstiness enables: same-level trajectories create the meta-learning pressure during training that ICRL exploits at inference time for adaptation to unseen environments

Concept map

16 direct connections · 171 in 2-hop network ·dense cluster

Why do trajectories matter more than individual … Does training data format shape reasoning strategy… What do models actually learn from chain-of-though… Can we allocate inference compute based on prompt … Can LLMs handle multiple tasks at once during infe… Can transformers learn to solve new problems withi…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

trajectory burstiness — same-level trajectories in context — is required for in-context learning of sequential decision-making across new tasks