Reasoning and Learning Architectures Reasoning and Knowledge

Can one statistical measure serve dual purposes in RL training?

Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.

Note · 2026-05-18 · sourced from Reasoning Methods CoT ToT
What actually changes inside a model during RL training? What does reward learning actually do to model reasoning?

The cross-rollout variance signal in DRO does double duty. First, it identifies the tokens within a reference answer whose certainty depends on the chain-of-thought, and up-weights those in the dense reward. Second, the same variance computed across a query's rollout group serves as a query-level filter: queries whose rollouts produce too little variance get discarded entirely, because they offer no comparative signal for learning.

The query-filter use is the underappreciated half. Most RL setups process every query in the batch equally, computing rewards across rollouts and updating the policy. But not every query carries gradient signal. Queries where all rollouts converge to the same answer with similar certainty contribute nothing — the comparative reward is degenerate, and the gradient is noise. Filtering these out before the update concentrates compute on queries where comparative learning is possible.

The two uses come from the same statistical quantity: cross-rollout variance over reasoning-reflective tokens. The token-level view says "which positions in this answer respond to reasoning differences." The query-level view says "does this entire query produce enough variation across rollouts to be worth learning from." Both are derived from the same self-supervised samples — no human labels, no PRM, no extra forward passes.

The empirical result is that DRO trains 2–3× faster with better stability than baselines on unverifiable tasks. The decomposition explains why: every gradient update spends compute on queries with measurable signal, and within each query, the gradient concentrates on the tokens that actually carry reasoning sensitivity. Sample efficiency emerges from filtering at both grain levels.

The transferable principle: when a self-supervised signal exists, reuse it at multiple aggregation levels. The same statistic that identifies which tokens to weight also identifies which queries to keep. Looking for one such statistic per pipeline is cheap; designing two separate signals (one for filtering, one for weighting) is what makes other dense-reward pipelines expensive.

Related concepts in this collection

Concept map
13 direct connections · 104 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

cross-rollout variance functions simultaneously as reward signal and query filter — one statistical quantity unlocks sample-efficient RL on unverifiable tasks