How do process-level rewards compare to environment-extracted next-state signals?

This explores two ways of getting a learning signal richer than a single end-of-task score: process-level rewards (a model or judge scores each reasoning step) versus signals the agent reads directly off the environment after it acts (the next state, the error message, what changed in the world).

This explores two ways of getting a learning signal richer than a single end-of-task score. Process-level rewards score the *thinking* — was this step good given the goal — while environment-extracted next-state signals score the *consequence* — here is the world after you acted. The corpus suggests these aren't rivals so much as two ends of a spectrum, and that the second is quietly making the first less necessary.

The case for next-state signals is that the environment already contains most of what a process reward model has to be trained to produce. One line of work shows agents can treat the future states their own actions lead to as supervision, with no external reward at all, matching expert-dependent baselines on half the data Can agents learn from their own actions without external rewards?. Another goes further: when you feed an agent retrospective evidence of its own mistakes in-context, the policy *becomes* its own process reward model, turning tokenized environment feedback into dense per-step credit without any separate reward network Can environment feedback replace scalar rewards in policy learning?. The deeper reason these signals are valuable is that environment feedback carries two things a scalar reward throws away — an evaluative part (how well did that go) and a directive part (what should change) — and the directive part is exactly what step-level training can recover Can scalar rewards capture all the information in agent feedback?.

Meanwhile the process-reward camp has been busy escaping its own dependency on hand-annotation. You can derive step signals straight from the *structure* of a trajectory — tree topology, expert-aligned actions, tool-call positions — instead of training a separate process reward model Can trajectory structure replace hand-annotated process rewards?. The cleanest example is tree-search rollouts, where sibling subtrees are compared to convert a single outcome reward into step-level preferences automatically Can tree structure alone convert outcome rewards into process supervision?. So both camps are converging on the same move: manufacture dense, mid-trajectory signal cheaply rather than paying for step labels.

Where they genuinely differ is what they're good at. Process judges that *reason about* the reasoning beat classifier-style reward models and need far less data Can judges that reason about reasoning outperform classifier rewards?, which makes them strong when correctness is about the logic, not the world. Environment-extracted signals win when the world is the arbiter and feedback is cheap to read. And there are hybrids that don't fit either box — using the agent's own shifting belief in the answer as a dense intrinsic reward, no critic and no environment query needed Can an agent's own beliefs guide credit assignment without critics?.

The thing worth taking away: the interesting comparison isn't "which signal is better" but "how cheaply can you fabricate dense feedback," and on that axis the environment is often the most underused free reward model you already have. There's even a strict-er version of this — agents have been mathematically shown to repurpose the environment itself as external memory without ever being asked to Do RL agents accidentally use environments as memory?. A related caution: not all of this signal needs to be positive. Training only on what *not* to do can match full RL while preserving diversity Does negative reinforcement alone outperform full reinforcement learning?, a reminder that the richest feedback channel isn't automatically the best teacher.

Sources 9 notes

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Can tree structure alone convert outcome rewards into process supervision?

Tree-GRPO uses branching structure to transform trajectory-level outcome rewards into step-level preference signals through sibling subtree comparison, eliminating the need for separate process reward models or step-level annotation while scaling with computational budget.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Do RL agents accidentally use environments as memory?

Mathematical proof shows that environmental artifacts reduce information needed to represent history in RL agents. Path-following agents naturally develop memory-like behavior through standard reward optimization, satisfying situated cognition criteria without explicit memory objectives.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

How do process-level rewards compare to environment-extracted next-state signals?

Sources 9 notes

Next inquiring lines