INQUIRING LINE

Why does step-level expert alignment work when outcome-only RL fails?

This explores why rewarding a model for matching expert moves at each step (a dense, process-level signal) succeeds at teaching hard reasoning where rewarding only the final answer (outcome-only RL) stalls — and what the corpus says is actually going wrong in the outcome-only case.


This explores why step-by-step expert alignment teaches reasoning that outcome-only reinforcement learning can't, and the corpus points to a single root cause: when every rollout on a hard problem fails, an outcome-only reward is silent. There's no gradient, nothing to learn from. Step-level expert-similarity rewards solve this by scoring how closely the model's action at each step matches an expert's, so the signal stays dense even when no full attempt succeeds Can step-wise expert rewards help small models learn hard reasoning?. The reward isn't waiting for a win at the end — it's grading the path.

The deeper trouble is that outcome-only RL doesn't just fail quietly on hard problems; it actively misleads. When a model occasionally stumbles onto a right answer through a degenerate shortcut, group-relative reward normalization treats that rare accidental success as a high-value trajectory and reinforces the shortcut — answer-repetition, computation-skipping — instead of sound reasoning Do overly hard RLVR samples actually harm model capabilities?. Worse, this corruption isn't local: rewarding only final correctness sharpens the whole policy, concentrating probability mass on what already works while draining the diversity needed to ever crack the unsolved problems Does outcome-based RL diversity loss spread across unsolved problems?. And even when outcome RL appears to improve scores, out-of-distribution tests reveal it often sharpened template-matching rather than installing a real procedure Do fine-tuned language models actually learn optimization procedures?.

Step-level alignment sidesteps all three failure modes because the expert path supplies a reasonable trajectory to imitate before the model has to discover one on its own. The most striking result is that the two methods are complementary, not rivals: running supervised step-wise RL first to build a reasoning foundation, then outcome-based RL to refine it, beats either alone — because the imitation phase creates plausible rollouts that finally make the outcome reward informative Does sequencing imitation then exploration training improve reasoning?. The dense phase manufactures the very signal the sparse phase was missing.

The corpus also shows that 'step-level signal' is a broader family than expert imitation. You can manufacture process rewards without any expert at all: Monte Carlo tree search ranks solution paths by how often they lead to success, generating dense per-step quality signals that replace human annotation Can tree search replace human feedback in LLM training?. You can reward the structure of reasoning itself — tagging planning, exploration, and reflection — which cuts repetitive actions by nearly a third versus outcome-only training while generalizing better Can RL agents learn to reason better, not just succeed?. And you can shape the signal by filtering which trajectories count: keeping high-quality successes while preserving diverse failures as negative signal lets a 14B model reach frontier math performance Why do correct code trajectories teach models to tolerate errors?. The through-line across all of these: where the answer lands tells you almost nothing on a problem you can't yet solve — but how you got there tells you everything.

The thing you might not have expected: step-level supervision works best not as a permanent replacement for outcome RL but as a curriculum that comes first. It's the scaffolding that makes the sparse-reward phase finally have something to sharpen.


Sources 8 notes

Can step-wise expert rewards help small models learn hard reasoning?

Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does outcome-based RL diversity loss spread across unsolved problems?

RL that rewards only final answer correctness sharpens the policy globally, concentrating probability mass on correct trajectories for solved problems while simultaneously reducing diversity on unsolved ones. Historical exploration (training diversity via UCB-style bonuses) and batch exploration (test-time diversity via repetition penalties) require structurally different mechanisms.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

Can RL agents learn to reason better, not just succeed?

RLVMR uses structured meta-reasoning tags (planning, exploration, reflection, monitoring) with programmatic rewards to train agentic RL. This reduces repetitive actions by 31% compared to outcome-only methods while maintaining better generalization than supervised fine-tuning alone.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Next inquiring lines