Reinforcement Learning for LLMs

Can agents learn from their own actions without external rewards?

Explores whether future states produced by an agent's own decisions can serve as supervision signals, bridging the gap between passive imitation learning and reward-dependent reinforcement learning.

Note · 2026-05-03 · sourced from Data

Most language agents are trained either through supervised fine-tuning on expert demonstrations (which scales poorly and locks the agent into the imagination of its dataset) or through reinforcement learning (which fails when environments lack verifiable rewards or require long-horizon credit assignment). The early experience paradigm sits between these: the agent proposes its own actions in the environment, and the future states resulting from those actions become supervision signals — without requiring any reward signal at all.

The key move is reframing what counts as "supervision." In SFT, supervision means a human-labeled expert action. In RL, supervision means a scalar reward. In early experience, supervision means the consequence — the next state — that follows the agent's own action. This consequence is always available regardless of whether an environment exposes ground truth, because the environment always responds to actions even when it does not score them. A web form may not tell you whether you filled it out correctly, but it always tells you what happens next.

Two strategies operationalize this principle: implicit world modeling (using collected future states to ground the policy in environment dynamics by predicting next states) and self-reflection (comparing the agent's behavior to expert demonstrations to extract lessons from suboptimal decisions). Both strategies share the principle that consequences-of-actions constitute experience, even without rewards.

Across eight diverse environments, both strategies consistently outperform pure imitation baselines, achieve comparable performance with half the expert data or less, and serve as superior warm-starts for subsequent RL. The paradigm is therefore not a substitute for RL but a practical bridge — early experience trains the agent to understand its environment before any reward signal arrives, which means RL fine-tuning starts from a much stronger initialization.


Source: Data

Related concepts in this collection

Concept map
15 direct connections · 134 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

early experience is a third paradigm between imitation learning and reinforcement learning — agents convert their own action consequences into supervision without external rewards