Reinforcement Learning for LLMs LLM Reasoning and Architecture

Why do trajectories matter more than individual examples for in-context learning?

Can language models learn new sequential decision-making tasks from context alone, and if so, what data properties make this possible? This explores why isolated state-action pairs fail where full trajectories succeed.

Note · 2026-02-22 · sourced from Reasoning Architectures

In-context learning for supervised tasks works by providing a few input-output examples. Naively applying this to sequential decision making (providing a few state-action pairs) fails to enable ICL of new tasks. The key finding: the context must contain full or partial trajectories from the same environment level as the query — not just isolated examples. This property is called trajectory burstiness.

Why the difference matters: In supervised learning, examples can be from different instances — the model learns the function mapping. In sequential decision making, the model must generalize from the same level/environment to handle the wide range of states it may encounter at deployment. A sparse set of state-action pairs doesn't cover the state space; full trajectories do.

Trajectory burstiness is the probability that a given input sequence contains at least two trajectories from the same level. When this property is present in pre-training data, the model acquires the capacity to learn new tasks from demonstrations at inference time without weight updates.

Additional factors that increase ICL performance:

Generalization scope demonstrated: Train/test tasks differ greatly — different states, actions, dynamics, and reward functions. The model generalizes from, e.g., platform games to maze navigation from a handful of expert demonstrations. This is substantially harder than prior work that generalizes across reward function variants of the same environment.

The implication for dataset construction: sequential decision-making ICL requires a data distribution property (trajectory burstiness) that standard language modeling data does not naturally contain. This is a data structural requirement, not just a scale requirement.

This connects to Does training data format shape reasoning strategy more than domain? — here the structural property is at the trajectory level rather than the reasoning step level, but the principle is the same: data structure determines capability.


Source: Reasoning Architectures

Related concepts in this collection

Concept map
16 direct connections · 171 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

trajectory burstiness — same-level trajectories in context — is required for in-context learning of sequential decision-making across new tasks