Reasoning and Learning Architectures Reasoning and Knowledge

Can one statistical measure serve dual purposes in RL training?

Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.

Note · 2026-05-18 · sourced from Reasoning Methods CoT ToT

The cross-rollout variance signal in DRO does double duty. First, it identifies the tokens within a reference answer whose certainty depends on the chain-of-thought, and up-weights those in the dense reward. Second, the same variance computed across a query's rollout group serves as a query-level filter: queries whose rollouts produce too little variance get discarded entirely, because they offer no comparative signal for learning.

The query-filter use is the underappreciated half. Most RL setups process every query in the batch equally, computing rewards across rollouts and updating the policy. But not every query carries gradient signal. Queries where all rollouts converge to the same answer with similar certainty contribute nothing — the comparative reward is degenerate, and the gradient is noise. Filtering these out before the update concentrates compute on queries where comparative learning is possible.

The two uses come from the same statistical quantity: cross-rollout variance over reasoning-reflective tokens. The token-level view says "which positions in this answer respond to reasoning differences." The query-level view says "does this entire query produce enough variation across rollouts to be worth learning from." Both are derived from the same self-supervised samples — no human labels, no PRM, no extra forward passes.

The empirical result is that DRO trains 2–3× faster with better stability than baselines on unverifiable tasks. The decomposition explains why: every gradient update spends compute on queries with measurable signal, and within each query, the gradient concentrates on the tokens that actually carry reasoning sensitivity. Sample efficiency emerges from filtering at both grain levels.

The transferable principle: when a self-supervised signal exists, reuse it at multiple aggregation levels. The same statistic that identifies which tokens to weight also identifies which queries to keep. Looking for one such statistic per pipeline is cheap; designing two separate signals (one for filtering, one for weighting) is what makes other dense-reward pipelines expensive.

Related concepts in this collection

Can we identify which tokens actually matter for reasoning? Most tokens in an answer are determined by language patterns rather than reasoning. Is there a way to distinguish the small fraction of tokens whose certainty genuinely depends on the chain of thought?
the same variance used for token-level weighting
Can rubrics and dense rewards work together without hacking? Explores whether reward signals derived from rubrics suffer from exploitation, and whether separating rubric judgments from optimization signals could prevent this failure mode.
DRO's third leg: the rubric gate that handles feasibility
Can we reward reasoning steps without human annotation? Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
complementary self-supervised dense signal

Concept map

13 direct connections · 104 in 2-hop network ·medium cluster Open in graph ↗

Can one statistical measure serve dual purposes … Can we identify which tokens actually matter for r… Can rubrics and dense rewards work together withou… Can we reward reasoning steps without human annota…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

cross-rollout variance functions simultaneously as reward signal and query filter — one statistical quantity unlocks sample-efficient RL on unverifiable tasks

Can one statistical measure serve dual purposes in RL training?

Related concepts in this collection

Related papers in this collection