Reasoning and Learning Architectures

Can step-wise expert rewards help small models learn hard reasoning?

When small models fail on hard multi-step problems, can training them to match expert reasoning steps rather than final answers provide useful learning signals? This explores whether intermediate-step alignment might overcome the limitations of both supervised fine-tuning and outcome-based reinforcement learning.

Note · 2026-05-18 · sourced from Training Fine Tuning

Small open-source models hit a wall on hard multi-step reasoning problems. RLVR (Reinforcement Learning with Verifiable Rewards) fails when the model's success rate is effectively zero — no rollout produces the correct answer, and outcome-only supervision provides no positive signal. SFT (Supervised Fine-Tuning) overfits long demonstrations through rigid token-by-token imitation, particularly on small models where complex teacher traces exceed the student's representational capacity. Both methods fail on the same regime: small model, hard problem, no path to correctness through their standard supervision.

Supervised Reinforcement Learning (SRL) fills the gap. The framework reformulates problem-solving as generating a sequence of logical actions, with the model trained to produce an internal reasoning monologue before committing to each action. Rewards come not from final-answer correctness but from similarity between the model's actions and expert actions extracted from an SFT dataset, computed step-wise as the rollout proceeds.

The reward structure is the key shift. Outcome rewards are sparse and binary — correct or not. Step-wise similarity rewards are dense and smooth — partial credit for partial alignment with expert steps. The model receives useful signal even on problems where it never reaches the correct answer, because the gradient flows from incremental alignment with the demonstrated reasoning path rather than from final-answer matching.

This also addresses the SFT failure mode. SFT forces token-by-token imitation, which makes long expert traces brittle teaching examples for small models — one wrong predicted token derails the imitation. SRL operates at the action level, decomposing expert demonstrations into manageable steps. The model can be wrong about specific tokens while still receiving credit for action-level alignment.

The empirical result: SRL enables small models to learn problems previously unlearnable by SFT or RLVR. The method becomes most powerful as a curriculum component — SRL-then-RLVR initialization-and-refinement outperforms either method alone, with SRL building the foundation that RLVR can then sharpen.

Related concepts in this collection

Concept map
17 direct connections · 121 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

supervised RL provides step-wise expert-similarity rewards that yield learning signal even when all rollouts fail — bridges the SFT-RLVR gap for small models on hard reasoning