Can agents learn from their own actions without external rewards?

Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.

Note · 2026-05-03 · sourced from Data

Most language agents are trained either through supervised fine-tuning on expert demonstrations (which scales poorly and locks the agent into the imagination of its dataset) or through reinforcement learning (which fails when environments lack verifiable rewards or require long-horizon credit assignment). The early experience paradigm sits between these: the agent proposes its own actions in the environment, and the future states resulting from those actions become supervision signals — without requiring any reward signal at all.

The key move is reframing what counts as "supervision." In SFT, supervision means a human-labeled expert action. In RL, supervision means a scalar reward. In early experience, supervision means the consequence — the next state — that follows the agent's own action. This consequence is always available regardless of whether an environment exposes ground truth, because the environment always responds to actions even when it does not score them. A web form may not tell you whether you filled it out correctly, but it always tells you what happens next.

Two strategies operationalize this principle: implicit world modeling (using collected future states to ground the policy in environment dynamics by predicting next states) and self-reflection (comparing the agent's behavior to expert demonstrations to extract lessons from suboptimal decisions). Both strategies share the principle that consequences-of-actions constitute experience, even without rewards.

Across eight diverse environments, both strategies consistently outperform pure imitation baselines, achieve comparable performance with half the expert data or less, and serve as superior warm-starts for subsequent RL. The paradigm is therefore not a substitute for RL but a practical bridge — early experience trains the agent to understand its environment before any reward signal arrives, which means RL fine-tuning starts from a much stronger initialization.

Source: Data

Related concepts in this collection

Can agents learn beyond what their training data shows? Explores whether supervised fine-tuning on expert demonstrations creates a hard ceiling on agent competence, or whether agents can generalize to scenarios their curators never captured.
extends: companion piece — passivity trap is the diagnosis, early experience is the treatment
Can agent deployment itself generate training signals automatically? Can we extract learning signals from the natural next-states that agents encounter during real deployment—user replies, tool outputs, test verdicts—rather than relying on separate annotation pipelines? This reframes how agents improve continuously.
exemplifies: same principle in production — OpenClaw-RL treats next-state as universal supervision; this note generalizes the paradigm
Can scalar rewards capture all the information in agent feedback? Exploring whether numerical rewards alone can preserve both the evaluative judgment and directional guidance embedded in natural feedback—or if something crucial gets lost in the conversion.
extends: refines what supervision the next state actually contains — beyond binary reward
Can transformers learn to solve new problems within episodes? Explores whether RL-finetuned transformers can develop meta-learning abilities that let them adapt to unseen tasks through in-episode experience alone, without weight updates.
complements: ICRL treats in-context experience as supervision at deployment; early experience does it during training
Can agents learn from failure without updating their weights? Explores whether language models can improve through trial-and-error by storing reflections in memory rather than through gradient-based parameter updates. Tests if environmental feedback alone can drive learning.
exemplifies: same self-reflection strategy in a parameter-free form — early experience is the parametric version
Can 78 demonstrations teach agency better than 10000? Does agentic capability depend on data volume or curation quality? LIMI achieves 73.5% on AgencyBench with 78 samples versus 24-45% for models trained on 10K+, suggesting strategic demonstration design may matter far more than scale.
tension: LIMI argues curated demonstrations create agency; early-experience argues lived consequences do — two stories about where agency comes from

Concept map

15 direct connections · 134 in 2-hop network ·medium cluster

Can agents learn from their own actions without … Can agents learn beyond what their training data s… Can agent deployment itself generate training sign… Can scalar rewards capture all the information in … Can transformers learn to solve new problems withi… Can agents learn from failure without updating the… Can 78 demonstrations teach agency better than 100…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

early experience is a third paradigm between imitation learning and reinforcement learning — agents convert their own action consequences into supervision without external rewards