Can we improve reasoning by amplifying information at mutual information peaks?

This reads the question as: do reasoning gains come from concentrating learning on the rare high-information moments in a chain of thought — the forking points where the model's uncertainty is highest — rather than treating every token equally?

This explores whether reasoning improves when you find the few high-information moments in a model's thinking and pour the training signal there, instead of spreading it evenly across every token. The corpus says: surprisingly, yes — and the effect is dramatic. Only about 20% of tokens in a reasoning trace are high-entropy 'forking points' where the model genuinely chooses between paths, and training reinforcement learning exclusively on those tokens matches or even beats updating on all of them Do high-entropy tokens drive reasoning model improvements?. The minority carries the learning signal; the rest is filler. That's the strongest direct support for the question's premise — the 'peaks' aren't a metaphor, they're a measurable, exploitable minority.

But amplifying information isn't the same as amplifying confidence or length, and the corpus is sharp on this distinction. You can use the model's own confidence at the answer span as a reward to rank reasoning traces, which improves step-by-step reasoning while also fixing the calibration damage that human-feedback training tends to cause Can model confidence work as a reward signal for reasoning?. Confidence here works as a proxy for informativeness. The danger sign is when training optimizes the wrong signal: supervised fine-tuning raises benchmark accuracy while cutting 'Information Gain' by nearly 39% — the model learns to produce correct answers through post-hoc rationalization rather than genuine inferential steps, and standard metrics miss it entirely because they only check the final answer Does supervised fine-tuning improve reasoning or just answers?. So you can degrade the very thing the question wants to amplify, while your scoreboard goes up.

There's a deeper reframing worth knowing: the information you'd want to amplify may already be in the model, waiting to be elicited rather than created. Five independent methods all unlock reasoning that's latently present in base-model activations — post-training selects reasoning, it doesn't build it Do base models already contain hidden reasoning ability?. In the same spirit, verbose versus concise reasoning occupies distinct linear directions in activation space that you can steer with a single extracted vector and no retraining Can we steer reasoning toward brevity without retraining?. That suggests 'amplifying at the peaks' might be done at inference time by nudging activations, not just by reweighting the training loss.

Two cautions keep this from being a free lunch. More signal is not monotonically better: chain-of-thought accuracy follows an inverted-U with length, and capable models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?; pushing thinking tokens from ~1,100 to ~16K dropped accuracy from 87% to 70% as the model overthought easy problems Does more thinking time always improve reasoning accuracy?. And the fluent reasoning you'd be amplifying can be hollow — chain-of-thought degrades predictably outside its training distribution, imitating the form of reasoning without valid underlying logic Does chain-of-thought reasoning actually generalize beyond training data?. So the honest answer: amplifying genuine information at the high-entropy forking points has real evidence behind it, but only if your signal tracks information gain rather than confidence-shaped surface correctness, and only up to the point where more becomes noise.

Sources 8 notes

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Can we improve reasoning by amplifying information at mutual information peaks?

Sources 8 notes

Next inquiring lines