Can step-wise expert rewards help small models learn hard reasoning?

When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.

Note · 2026-05-18 · sourced from Training Fine Tuning

Small open-source models hit a wall on hard multi-step reasoning problems. RLVR (Reinforcement Learning with Verifiable Rewards) fails when the model's success rate is effectively zero — no rollout produces the correct answer, and outcome-only supervision provides no positive signal. SFT (Supervised Fine-Tuning) overfits long demonstrations through rigid token-by-token imitation, particularly on small models where complex teacher traces exceed the student's representational capacity. Both methods fail on the same regime: small model, hard problem, no path to correctness through their standard supervision.

Supervised Reinforcement Learning (SRL) fills the gap. The framework reformulates problem-solving as generating a sequence of logical actions, with the model trained to produce an internal reasoning monologue before committing to each action. Rewards come not from final-answer correctness but from similarity between the model's actions and expert actions extracted from an SFT dataset, computed step-wise as the rollout proceeds.

The reward structure is the key shift. Outcome rewards are sparse and binary — correct or not. Step-wise similarity rewards are dense and smooth — partial credit for partial alignment with expert steps. The model receives useful signal even on problems where it never reaches the correct answer, because the gradient flows from incremental alignment with the demonstrated reasoning path rather than from final-answer matching.

This also addresses the SFT failure mode. SFT forces token-by-token imitation, which makes long expert traces brittle teaching examples for small models — one wrong predicted token derails the imitation. SRL operates at the action level, decomposing expert demonstrations into manageable steps. The model can be wrong about specific tokens while still receiving credit for action-level alignment.

The empirical result: SRL enables small models to learn problems previously unlearnable by SFT or RLVR. The method becomes most powerful as a curriculum component — SRL-then-RLVR initialization-and-refinement outperforms either method alone, with SRL building the foundation that RLVR can then sharpen.

Related concepts in this collection

Does sequencing imitation then exploration training improve reasoning? Can combining Supervised RL (expert imitation) followed by RLVR (outcome rewards) outperform either method alone on hard reasoning tasks? This explores whether curriculum ordering unlocks capabilities neither method achieves independently.
same paper, the curriculum combination
Can curriculum learning approximate expensive process supervision? Can a reverse curriculum that slides backward from task completion provide step-level insight comparable to human process annotations, but at outcome supervision cost?
adjacent: another method to bridge SFT and RLVR
Why does teacher-student information asymmetry enable learning signals? What role does privileged answer access play in making social meta-learning training work? Without asymmetric information, can a conversation between teacher and student function as pedagogy or only as parallel speculation?
adjacent: another method using expert/privileged information for small-model training

Concept map

17 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Can step-wise expert rewards help small models l… Does sequencing imitation then exploration trainin… Can curriculum learning approximate expensive proc… Why does teacher-student information asymmetry ena…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

supervised RL provides step-wise expert-similarity rewards that yield learning signal even when all rollouts fail — bridges the SFT-RLVR gap for small models on hard reasoning

Can step-wise expert rewards help small models learn hard reasoning?

Related concepts in this collection

Related papers in this collection