Can we reward reasoning steps without human annotation?

Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search

"Learning to Think" (L2T) addresses the dense process reward problem — how to evaluate the contribution of individual reasoning steps without human annotation or task-specific evaluators — through information theory.

The key problem: existing RL methods for reasoning use only final outcome rewards. Under this sparse feedback, extending the chain incurs no cost. Even a tiny accuracy gain from many extra steps registers as a positive signal. Models develop a "one more thought" bias, consuming more than double the tokens actually needed for correct answers. On simple tasks (e.g., "12 + 5"), overly long chains can reduce accuracy — the redundant computation is not just wasteful but actively harmful.

L2T proposes a universal dense process reward with two components:

Fitting information gain: quantifies how much each reasoning episode contributes to capturing correctness-critical information in the model's parameters
Compression penalty: discourages excessive optimization, preserving efficiency

The reward is estimated via PAC-Bayes bounds and the Fisher information matrix, providing a tractable approximation with theoretical guarantees. Each query-response interaction is treated as a hierarchical session of multiple episodes. Upon each episode's completion, the reward is immediately computed — no waiting for the final answer.

This positions L2T as a third option in the ORM/PRM taxonomy. ORMs provide sparse outcome-only feedback (cheap but uninformative for intermediate steps). PRMs provide dense step-level feedback (informative but requires expensive annotation). L2T provides dense information-theoretic feedback (informative and annotation-free), with the trade-off being computational overhead for Fisher information estimation. The principle that dense process rewards outperform outcome-only signals extends beyond reasoning chains to agentic systems: Does supervising retrieval steps outperform final answer rewards? demonstrates the same finding in agentic RAG, where step-level retrieval rewards substantially improve search agent training over final-answer-only reward.

The task-dependence finding matters: moderate chain extensions improve coverage of critical steps on hard problems (Tier 4 multi-stage math), while the same extensions reduce accuracy on simple problems (Tier 1). No fixed chain length is optimal across tasks. This reinforces Can we allocate inference compute based on prompt difficulty? — the budget must be adaptive, and L2T provides the per-episode signal to enable that adaptation.

Source: Reasoning o1 o3 Search

Related concepts in this collection

Why do outcome-based reward models fail at intermediate step evaluation? Outcome-based reward models (ORMs) evaluate only final results, creating a mismatch with the need to assess reasoning quality at intermediate steps. Understanding this failure mode matters for building better AI reasoning systems.
L2T is a third option: dense + annotation-free
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
L2T's compression penalty directly addresses the degradation mechanism: penalizing tokens that don't contribute information gain
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
L2T provides the per-episode signal that makes adaptive allocation possible
Does supervising retrieval steps outperform final answer rewards? Can intermediate feedback on retrieval decisions—which documents to fetch, when to stop—train agentic RAG systems more effectively than rewarding only the final answer? This matters because poor retrieval paths can accidentally succeed or good ones can fail on noisy metrics.
converging evidence from the agentic retrieval domain: RAG-Gym shows empirically that dense step-level rewards outperform outcome-only rewards for training search agents; L2T provides the information-theoretic framework that explains why -- per-episode information gain quantifies what outcome-only reward cannot

Concept map

16 direct connections · 169 in 2-hop network ·dense cluster

Can we reward reasoning steps without human anno… Why do outcome-based reward models fail at interme… Does more thinking time always improve reasoning a… Can we allocate inference compute based on prompt … Does supervising retrieval steps outperform final …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

information-theoretic dense process rewards quantify episode-wise contribution to answer correctness — outcome-only rl produces more than double the needed tokens