Can agents learn beyond what their training data shows?

Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.

Note · 2026-05-03 · sourced from Data

The dominant paradigm for training language agents is supervised fine-tuning on expert-curated demonstrations. This bypasses the need for reward signals by letting agents map states to actions using static datasets. But the convenience hides a structural limitation: the agent never interacts with the environment during training, never observes the outcomes of its own actions, and therefore cannot learn from failure, refine its decision-making, or generalize to unseen situations.

The deeper problem is that the agent's competence is bounded by what the demonstration curators imagined. Every state-action pair in the dataset reflects a scenario someone thought to capture. Scenarios outside that imagination — edge cases, recovery from errors, paths the expert would never take — do not exist in the training signal at all. This means the agent learns the expert's idealized trajectory, not the structure of the environment. When the deployed environment presents anything unfamiliar, the agent has no internal model that can extrapolate, because its training never exposed it to consequences.

This is a passivity trap. Scaling high-quality human demonstrations is expensive and difficult to sustain, but even unlimited expert data would not solve the underlying problem — the agent is bound by the coverage of the demonstrations rather than by its own capacity to grow from experience. The demonstration paradigm assumes the world stops where the dataset stops.

The implication for agentic AI design is significant: data quantity and even data quality are insufficient. What agents need is the capacity to convert their own actions into learning signals — which is exactly what Can agents learn from their own actions without external rewards? proposes — requiring the agent to be in the environment, not merely trained on a snapshot of it.

Source: Data

Related concepts in this collection

Can agents learn from their own actions without external rewards? Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.
extends: companion piece — diagnosis vs treatment of the passivity trap
Can non-reasoning models catch up with more compute? Explores whether inference-time compute budget can close the performance gap between standard models and those trained for reasoning, and what training mechanisms might enable this.
exemplifies: SFT/imitation ceiling argument generalizes — bounded by training demonstration quality
Can models trained on many imperfect experts outperform each one? Do generative models trained on diverse, imperfect human experts develop an implicit consensus that surpasses any individual contributor? This explores whether aggregating diverse perspectives at training time, rather than inference time, can denoise human biases.
tension: counter-claim — diverse expert demonstrations can exceed any individual expert; the bound here is curatorial breadth, not aggregation
Can 78 demonstrations teach agency better than 10000? Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.
tension: LIMI argues curation produces agency from minimal data; this note argues curation alone is the ceiling — both can be right depending on whether environment interaction is downstream
Why do LLM agents ignore condensed experience summaries? LLM agents faithfully learn from raw experience but systematically disregard condensed summaries of the same experience. This study investigates whether the problem lies in how summaries are made, how models process them, or whether models simply don't need them.
complements: even after escaping demonstration imagination, agents privilege raw over condensed experience — the imagination problem recurs at the experience-summarization level
Why do AI agents fail at workplace social interaction? Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.
exemplifies: the deployment gap that demonstration training cannot close — real task variability exceeds demonstration coverage

Concept map

17 direct connections · 167 in 2-hop network ·dense cluster

Can agents learn beyond what their training data… Can agents learn from their own actions without ex… Can non-reasoning models catch up with more comput… Can models trained on many imperfect experts outpe… Can 78 demonstrations teach agency better than 100… Why do LLM agents ignore condensed experience summ… Why do AI agents fail at workplace social interact…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

expert demonstrations lock agents into the imagination of the training data — restricting what an agent can learn to scenarios its curators happened to consider