INQUIRING LINE

Can early experience replace external rewards as a learning signal?

This explores whether agents can learn from the consequences of their own actions — treating future states as supervision — instead of relying on engineered external reward signals.


This explores whether 'early experience' — letting an agent learn from what happens after its own actions, rather than from hand-built rewards — can stand in for external reward signals. The corpus says: yes, increasingly so, and the most direct evidence frames it as a genuine third option. One line of work positions early experience as a paradigm sitting between imitation learning (copying experts) and reinforcement learning (chasing rewards), showing across eight environments that agents using their own future states as supervision can match expert-dependent baselines with half the data — and then serve as a stronger warm-start for later RL Can agents learn from their own actions without external rewards?. So it's not just a replacement; it's often a better foundation to build rewards on top of later.

The deeper reason this works is that an agent's own experience is *richer* than a scalar reward. A reward collapses everything into a single number, but the consequences of an action carry two separate kinds of information: how well it did (evaluative) and how it should change (directive) — and scalar rewards throw the second one away Can scalar rewards capture all the information in agent feedback?. Once you keep that richer signal, you can convert raw environment feedback into dense, per-token learning gradients by letting the policy teach itself from retrospective evidence of its mistakes — making an external reward model unnecessary Can environment feedback replace scalar rewards in policy learning?. Natural-language critiques do something similar: they break through plateaus that numerical rewards can't, precisely because they explain *why* a failure happened Can natural language feedback overcome numerical reward plateaus?.

There's an even more internal version of this idea: the signal doesn't have to come from the environment at all, but from the agent's own shifting beliefs. Tracking how much an action moves the model toward a solution — the log-ratio of its own probability estimates — yields a dense intrinsic reward with no critic and no process reward model, and smaller models trained this way beat larger baselines Can an agent's own beliefs guide credit assignment without critics?. This is the same insight pushed inward: the learning signal was latent in the agent's experience the whole time.

The cross-current worth knowing is that external rewards may have been doing less than we assumed anyway. Several notes argue that reward-based RL mostly *activates* capabilities already present from pretraining rather than teaching anything new — a single example, or even spurious rewards, can trigger the same gains, and base models can outperform RLVR models at high sampling budgets What does reward learning actually do to model reasoning? Does RLVR actually expand what models can reason about?. If reward signals are largely surfacing existing skills, then the bar for replacing them with experience is lower than it looks. And the experience signal can be shaped smartly: process successes and failures differently (concrete demos vs. abstracted lessons) Should successful and failed episodes be processed differently?, or lean on negative examples alone, which can match full RL while preserving diversity Does negative reinforcement alone outperform full reinforcement learning?.

The honest caveat the corpus implies: 'replace' is too clean. Early experience tends to *precede and improve* reward-based training rather than abolish it, and RL still shows its own structured learning dynamics — first mastering execution, then strategic planning Does RL training follow a predictable two-phase learning sequence?. The interesting takeaway you didn't come looking for: the boundary between 'reward' and 'experience' is dissolving. When you keep the full texture of what an action led to, the reward isn't external anymore — it was inside the experience all along.


Sources 10 notes

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Next inquiring lines