Agentic and Multi-Agent Systems

Can agents learn beyond what their training data shows?

Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.

Note · 2026-05-03 · sourced from Data

The dominant paradigm for training language agents is supervised fine-tuning on expert-curated demonstrations. This bypasses the need for reward signals by letting agents map states to actions using static datasets. But the convenience hides a structural limitation: the agent never interacts with the environment during training, never observes the outcomes of its own actions, and therefore cannot learn from failure, refine its decision-making, or generalize to unseen situations.

The deeper problem is that the agent's competence is bounded by what the demonstration curators imagined. Every state-action pair in the dataset reflects a scenario someone thought to capture. Scenarios outside that imagination — edge cases, recovery from errors, paths the expert would never take — do not exist in the training signal at all. This means the agent learns the expert's idealized trajectory, not the structure of the environment. When the deployed environment presents anything unfamiliar, the agent has no internal model that can extrapolate, because its training never exposed it to consequences.

This is a passivity trap. Scaling high-quality human demonstrations is expensive and difficult to sustain, but even unlimited expert data would not solve the underlying problem — the agent is bound by the coverage of the demonstrations rather than by its own capacity to grow from experience. The demonstration paradigm assumes the world stops where the dataset stops.

The implication for agentic AI design is significant: data quantity and even data quality are insufficient. What agents need is the capacity to convert their own actions into learning signals — which is exactly what Can agents learn from their own actions without external rewards? proposes — requiring the agent to be in the environment, not merely trained on a snapshot of it.


Source: Data

Related concepts in this collection

Concept map
17 direct connections · 167 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

expert demonstrations lock agents into the imagination of the training data — restricting what an agent can learn to scenarios its curators happened to consider