Can we identify which tokens actually matter for reasoning?
Most tokens in an answer are determined by language patterns rather than reasoning. Is there a way to distinguish the small fraction of tokens whose certainty genuinely depends on the chain of thought?
DRO introduces a clean operational definition of "the tokens that depend on the reasoning." For each token in a reference answer, measure the model's self-certainty under different sampled chain-of-thought prefixes. Most tokens — articles, connectives, lexically expected words — barely change in certainty across rollouts. A small minority show high variance: their certainty depends on which reasoning path was taken. These are the reasoning-reflective tokens. They are not lexically distinctive — they cannot be identified by surface features — but they carry the answer's actual sensitivity to the reasoning chain.
The implication for reward design is that the signal-to-noise ratio of a uniform average across all reference tokens is bad. Most of the average is dominated by tokens whose certainty is determined by language modeling rather than by reasoning. Whatever differential the reasoning chain produces is diluted by tokens that would have appeared regardless. The variance filter is what isolates the reasoning-bearing fraction of the answer.
Up-weighting these high-variance tokens produces a sharper reward contrast across rollouts in a group. The mechanism is purely statistical — no human annotation, no per-step rubric, no extra model. Cross-rollout variance is computed from the policy's own samples, which makes the method cheap relative to process reward models (PRMs) that require labeled intermediate steps.
The deeper point is that token-level reward dense-ness is not the issue. Token-level dense rewards have been proposed before. The issue is which tokens to weight, and the answer "weight tokens by their variance under different reasoning prefixes" turns out to be a self-supervised filter that recovers the reasoning-bearing dimension without supervision.
This connects to L2T's information-theoretic dense process rewards as an alternative dense-signal strategy: L2T scores reasoning steps by their contribution to answer correctness; DRO scores tokens by their sensitivity to reasoning. Both replace uniform averaging with a structure-aware signal; both achieve sample efficiency by concentrating the gradient where it matters.
Related concepts in this collection
-
Can we reward reasoning steps without human annotation?
Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
alternative dense-reward design at the step level rather than the token level
-
Which tokens in reasoning chains actually matter most?
Do language models internally rank tokens by functional importance? Greedy pruning experiments explore whether models preserve symbolic computation while discarding linguistic scaffolding, and what this reveals about reasoning architecture.
independent evidence that reasoning chains have token-level structure that uniform averaging hides
-
Can rubrics and dense rewards work together without hacking?
Explores whether reward signals derived from rubrics suffer from exploitation, and whether separating rubric judgments from optimization signals could prevent this failure mode.
DRO's other half: the rubric-gate that complements R3
-
Can one statistical measure serve dual purposes in RL training?
Explores whether cross-rollout variance can simultaneously weight important tokens and filter low-signal queries, potentially unlocking efficiency gains in reasoning tasks without human labels.
DRO's third use of the same variance signal
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
reasoning-reflective tokens are identifiable by high cross-rollout variance under different CoT prefixes — most reference tokens are reasoning-invariant and dilute uniformly-averaged signals