Reasoning and Learning Architectures Reasoning and Knowledge

Can we identify which tokens actually matter for reasoning?

Most tokens in an answer are determined by language patterns rather than reasoning. Is there a way to distinguish the small fraction of tokens whose certainty genuinely depends on the chain of thought?

Note · 2026-05-18 · sourced from Reasoning Methods CoT ToT
What actually changes inside a model during RL training? What does reward learning actually do to model reasoning?

DRO introduces a clean operational definition of "the tokens that depend on the reasoning." For each token in a reference answer, measure the model's self-certainty under different sampled chain-of-thought prefixes. Most tokens — articles, connectives, lexically expected words — barely change in certainty across rollouts. A small minority show high variance: their certainty depends on which reasoning path was taken. These are the reasoning-reflective tokens. They are not lexically distinctive — they cannot be identified by surface features — but they carry the answer's actual sensitivity to the reasoning chain.

The implication for reward design is that the signal-to-noise ratio of a uniform average across all reference tokens is bad. Most of the average is dominated by tokens whose certainty is determined by language modeling rather than by reasoning. Whatever differential the reasoning chain produces is diluted by tokens that would have appeared regardless. The variance filter is what isolates the reasoning-bearing fraction of the answer.

Up-weighting these high-variance tokens produces a sharper reward contrast across rollouts in a group. The mechanism is purely statistical — no human annotation, no per-step rubric, no extra model. Cross-rollout variance is computed from the policy's own samples, which makes the method cheap relative to process reward models (PRMs) that require labeled intermediate steps.

The deeper point is that token-level reward dense-ness is not the issue. Token-level dense rewards have been proposed before. The issue is which tokens to weight, and the answer "weight tokens by their variance under different reasoning prefixes" turns out to be a self-supervised filter that recovers the reasoning-bearing dimension without supervision.

This connects to L2T's information-theoretic dense process rewards as an alternative dense-signal strategy: L2T scores reasoning steps by their contribution to answer correctness; DRO scores tokens by their sensitivity to reasoning. Both replace uniform averaging with a structure-aware signal; both achieve sample efficiency by concentrating the gradient where it matters.

Related concepts in this collection

Concept map
12 direct connections · 116 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

reasoning-reflective tokens are identifiable by high cross-rollout variance under different CoT prefixes — most reference tokens are reasoning-invariant and dilute uniformly-averaged signals