Reinforcement Learning for LLMs

Can we reward reasoning steps without human annotation?

Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search
How should we allocate compute budget at inference time?

"Learning to Think" (L2T) addresses the dense process reward problem — how to evaluate the contribution of individual reasoning steps without human annotation or task-specific evaluators — through information theory.

The key problem: existing RL methods for reasoning use only final outcome rewards. Under this sparse feedback, extending the chain incurs no cost. Even a tiny accuracy gain from many extra steps registers as a positive signal. Models develop a "one more thought" bias, consuming more than double the tokens actually needed for correct answers. On simple tasks (e.g., "12 + 5"), overly long chains can reduce accuracy — the redundant computation is not just wasteful but actively harmful.

L2T proposes a universal dense process reward with two components:

The reward is estimated via PAC-Bayes bounds and the Fisher information matrix, providing a tractable approximation with theoretical guarantees. Each query-response interaction is treated as a hierarchical session of multiple episodes. Upon each episode's completion, the reward is immediately computed — no waiting for the final answer.

This positions L2T as a third option in the ORM/PRM taxonomy. ORMs provide sparse outcome-only feedback (cheap but uninformative for intermediate steps). PRMs provide dense step-level feedback (informative but requires expensive annotation). L2T provides dense information-theoretic feedback (informative and annotation-free), with the trade-off being computational overhead for Fisher information estimation. The principle that dense process rewards outperform outcome-only signals extends beyond reasoning chains to agentic systems: Does supervising retrieval steps outperform final answer rewards? demonstrates the same finding in agentic RAG, where step-level retrieval rewards substantially improve search agent training over final-answer-only reward.

The task-dependence finding matters: moderate chain extensions improve coverage of critical steps on hard problems (Tier 4 multi-stage math), while the same extensions reduce accuracy on simple problems (Tier 1). No fixed chain length is optimal across tasks. This reinforces Can we allocate inference compute based on prompt difficulty? — the budget must be adaptive, and L2T provides the per-episode signal to enable that adaptation.


Source: Reasoning o1 o3 Search

Related concepts in this collection

Concept map
16 direct connections · 169 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

information-theoretic dense process rewards quantify episode-wise contribution to answer correctness — outcome-only rl produces more than double the needed tokens