INQUIRING LINE

Do high-entropy RLVR tokens correspond to MI-peak tokens during inference?

This asks whether the small set of high-entropy 'forking' tokens that RLVR training targets are the same tokens that carry the most information (mutual-information peaks) when a model is actually generating an answer — and the corpus has strong material on the first half but only circumstantial material on the correspondence itself.


This explores a two-part claim: (1) that RLVR concentrates its learning on a minority of high-entropy tokens, and (2) that those same tokens are the high-information decision points during inference. On the first half, the corpus is direct and emphatic. Only about 20% of tokens show high entropy, and those tokens act as pivotal reasoning decision points — the places where a reasoning trace could branch one way or another. Training on just that 20% matches or even beats full-gradient training, which means the minority is where the actual learning signal lives Do high-entropy tokens drive reasoning model improvements?. So the 'forking point' framing is well supported.

The second half — whether high entropy lines up with mutual-information peaks at inference — isn't measured head-on in this collection, but several notes circle the same territory under different vocabulary. The throughline across the RLVR work here is that the method doesn't teach new reasoning; it re-weights and sharpens behaviors already latent in the base model. RLVR improves sampling efficiency without expanding the reasoning boundary Does RLVR actually expand what models can reason about?, activates pretraining strategies rather than installing them What does reward learning actually do to model reasoning?, and tends to amplify one dominant pretraining format while collapsing the alternatives Does RL training collapse format diversity in pretrained models?. If RLVR works by concentrating probability mass at exactly the branch points where the outcome is still uncertain, then high-entropy tokens and high-information tokens would be describing the same junctions from two angles — entropy is the model's uncertainty there, mutual information is how much the eventual answer depends on which way it forks.

That's a clean story, but the corpus also gives you reasons to be careful about assuming the correspondence is tight. RLVR can improve the local coherence of a trace — fewer logical errors between adjacent steps — without making the global proof valid Does RLVR actually improve mathematical reasoning or just coherence?. In other words, the model can get more confident (lower entropy) at exactly the steps that matter most for the answer (high mutual information) while still being wrong. Entropy reduction and genuine information gain can decouple. And when the training signal is poorly chosen — overly hard samples, contaminated rewards — the high-advantage tokens RLVR latches onto can be accidental shortcuts rather than real forking decisions Do overly hard RLVR samples actually harm model capabilities?, which would put the high-entropy tokens and the truly informative tokens in different places.

There's a useful adjacent signal too: calibrated token-probability uncertainty turns out to be a more reliable guide than external heuristics for deciding when a model needs help, e.g. when to retrieve Can simple uncertainty estimates beat complex adaptive retrieval?. That's indirect evidence that per-token uncertainty really does track the consequential moments in a generation — which is the bet underlying any high-entropy-equals-high-information argument.

The honest bottom line: the corpus strongly establishes that high-entropy tokens are the load-bearing tokens for RLVR, and it makes the entropy↔information correspondence plausible as a mechanism. But it does not contain a note that directly measures mutual information at inference and aligns it against the high-entropy set, and it actively warns that confidence and informativeness can come apart. If you want to chase this further, the forking-points note Do high-entropy tokens drive reasoning model improvements? is the place the correspondence is implicitly assumed, and the coherence-vs-validity note Does RLVR actually improve mathematical reasoning or just coherence? is the place it might break.


Sources 7 notes

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Does RLVR actually expand what models can reason about?

Pass@k analysis shows base models outperform RLVR models at high k, indicating RLVR doesn't expand solvable problems but rather narrows sampling toward solutions already in the base model's distribution. Distillation, by contrast, genuinely transfers new reasoning patterns.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does RLVR actually improve mathematical reasoning or just coherence?

RLVR post-training measurably reduces logical errors between adjacent reasoning steps, but locally coherent traces can still be globally invalid proofs. The improvement is structural rather than semantic.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can simple uncertainty estimates beat complex adaptive retrieval?

Calibrated token-probability uncertainty consistently beats multi-call adaptive retrieval on single-hop tasks and matches performance on multi-hop, using a fraction of the LM and retriever calls. The model's self-knowledge proves more reliable than external heuristics for deciding when to retrieve.

Next inquiring lines