Can distillation methods extract directional guidance that scalar RL cannot access?
This explores whether token-level distillation can recover the 'how to change' signal that a single scalar reward throws away — and what the corpus says about why directional information slips through RL's fingers in the first place.
This question reads as: scalar RL collapses everything it learns into one number per outcome, so can distillation reach into feedback and pull out the *directional* part — the 'do it this way instead' — that a scalar can't represent? The corpus has a direct answer and a surprising amount of lateral support for it.
The cleanest statement comes from work showing that natural agent feedback splits into two orthogonal channels: *evaluative* (how well did that action go) and *directive* (how should it change) Can scalar rewards capture all the information in agent feedback?. A scalar reward is built to carry the first and structurally cannot carry the second — 'good/bad by this much' has no slot for 'here's the corrected move.' Token-level distillation does have that slot, because it copies a *distribution over next tokens* rather than a single score, so it recovers the directional specifics the reward discarded. The two aren't competing; they're complementary, which reframes the whole question: distillation isn't a better RL, it's accessing a different axis of the same feedback.
Why would scalar RL leave so much on the table? Several notes suggest it's not a tuning problem but a structural one. RL updates only a sparse 5–30% slice of parameters, and consistently the same slice across seeds Does reinforcement learning update only a small fraction of parameters? — it nudges a narrow subnetwork rather than re-teaching. It also collapses onto a single dominant pretrained format within the first epoch, suppressing alternatives Does RL training collapse format diversity in pretrained models?. And when the reward signal is sparse or the problems are too hard, scalar RL doesn't just fail to learn — it learns *degenerate shortcuts* and amplifies them, because group-relative normalization treats a lucky correct answer as a high-advantage trajectory worth repeating Do overly hard RLVR samples actually harm model capabilities?. A scalar can't tell 'right for the right reason' from 'right by accident'; directional supervision can.
That's exactly the gap several methods close by smuggling directional signal back in. Adaptive guidance hands the model partial ground-truth solution traces on hard problems instead of waiting for a reward to materialize Can adaptive guidance from solution traces reduce reward sparsity in RL?. Process supervision derived from trajectory *structure* — tree topology, expert-aligned actions, tool-call positions — converts a single outcome reward into dense per-step signal without any annotated reward model Can trajectory structure replace hand-annotated process rewards?. Both are doing the directive job: not scoring the outcome, but pointing at the next move. Proxy-tuning makes the case sharpest — by shifting the output distribution at decoding time and leaving base weights untouched, it closes 88–91% of the alignment gap while *beating* direct fine-tuning on knowledge tasks, because direct weight updates corrupt lower-layer knowledge storage that a distributional shift never touches Can decoding-time tuning preserve knowledge better than weight fine-tuning?.
The thing worth carrying away: the gap isn't that RL is weak and distillation is strong. It's that 'reward' and 'demonstration' are different *kinds* of information. A scalar is a verdict; a distribution is an instruction. The frontier methods here all win by recovering the instruction — and the honest caveat from the corpus is that distillation inherits whatever bias lives in the traces it copies, so a directive signal is only as trustworthy as its source.
Sources 7 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
GHPO dynamically provides ground-truth solution traces for hard problems while using standard RL for manageable ones, achieving 5% gains across math benchmarks. This converts wasted compute on impossible problems into learning signal by leveraging traces already present in training data.
Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.
Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.