Can an agent's internal probabilities serve as value signals across domains?

This explores whether an AI agent's own token probabilities — its internal confidence as it works — can stand in for an externally trained reward signal, and whether that trick holds up when you move from one task to another.

This explores whether an agent's own token probabilities — its internal sense of confidence — can substitute for a separately trained reward model, and whether that holds across different tasks. The corpus suggests the answer is increasingly yes, and that this is one of the more interesting shifts happening in how models are trained.

The sharpest example is belief-shift as reward: instead of bolting on a critic network or a process-reward model, you watch how much the agent's probability estimate of the correct answer moves from one turn to the next, and treat that log-ratio as a dense, per-step intrinsic reward Can an agent's own beliefs guide credit assignment without critics?. The striking part is that small models trained this way matched or beat larger baselines *and generalized beyond their training* — which is exactly the cross-domain transfer the question asks about. The signal isn't tied to a hand-built scorer for one task; it's the agent's own evolving belief, which exists for any task.

What makes this feel less like a one-off hack is that the field is converging on it from several directions at once. One synthesis frames late-2025 RL as three substitutable moves, each of which deletes a different piece of the old RLHF machinery using the policy's own computations: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces the explicit reward Can language models replace reward models with internal signals?. A related trick squeezes self-evaluation into the unused sequence space after a model finishes its answer, so the model learns to compute its own reward during training at zero inference cost Can models learn to evaluate their own work during training?. Different mechanisms, same underlying bet: the value signal can come from inside.

But the corpus also marks the limits, and this is the part you might not expect. Internal probabilities are a *scalar* read on the agent's state — and one note argues that agent feedback actually carries two orthogonal kinds of information: evaluative (how well did that go) and directive (how should it change). A scalar reward captures the first and throws away the second Can scalar rewards capture all the information in agent feedback?. So a probability-as-value signal may be telling you *whether* you're getting warmer without telling you *which way* to step. And there's a deeper ceiling: an agent's internal signals can only reach as far as its own experience does. Agents bound to static expert demonstrations stay capped by what the curators imagined, never learning from their own failures Can agents learn beyond what their training data shows? — which is precisely why approaches that let agents treat the consequences of their *own* actions as supervision can match expert baselines on half the data Can agents learn from their own actions without external rewards?.

If you want to go deeper, the thread worth pulling is this: internal probabilities work as cross-domain value signals because they're cheap and universal, but they're a compression of something richer. The open question the corpus circles is when that compression is good enough — and the convergence in Can language models replace reward models with internal signals? suggests a lot of researchers are betting it usually is.

Sources 6 notes

Can an agent's own beliefs guide credit assignment without critics?

ΔBelief-RL uses log-ratios of sequential probability estimates to assign per-turn credit without critic networks or process reward models. Tested on 20 Questions, smaller models trained this way matched or exceeded prior SOTA and larger baselines while generalizing beyond training.

Can language models replace reward models with internal signals?

Late-2025 RL literature independently converges on three patterns that replace different RLHF components: pairwise self-judgment replaces the reward model, internal belief-shift replaces the critic, and rich-feedback self-distillation replaces explicit reward signals. Each emerges from the policy's own computations, making the trained reward classifier optional.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can agents learn beyond what their training data shows?

Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.

Can agents learn from their own actions without external rewards?

Research across eight environments shows that agents can use future states from their own actions as supervision without external rewards, matching expert-dependent baselines with half the data and providing superior warm-starts for subsequent RL training.

Can an agent's internal probabilities serve as value signals across domains?

Sources 6 notes

Next inquiring lines