Training Large Language Models for Reasoning through Reverse Curriculum Reinforcement Learning

Paper · arXiv 2402.05808 · Published February 8, 2024

The core challenge in applying RL to complex reasoning is to identify a sequence of actions that result in positive rewards and provide appropriate supervision for optimization. Outcome supervision provides sparse rewards for final results without identifying error locations, whereas process supervision offers step-wise rewards but requires extensive manual annotation. R3 overcomes these limitations by learning from correct demonstrations. Specifically, R3 progressively slides the start state of reasoning from a demonstration’s end to its beginning, facilitating easier model exploration at all stages.

Training a language model specialized in reasoning is proved to be superior to prompting-based approaches (Uesato et al., 2022; Yu et al., 2023b). However, Supervised Fine-tuning (SFT) focuses on imitating human demonstrations, requiring large-scale, diverse annotations to achieve generalization (Lightman et al., 2023; Yuan et al., 2023; Shen et al., 2021). Reinforcement learning (RL) offers a viable alternative to improve reasoning via exploration and learning (Bai et al., 2022; Ouyang et al., 2022; Zheng et al., 2023; Luo et al., 2023).

When applying RL to complex reasoning tasks, the core challenge lies in identifying a sequence of actions that yield positive rewards and providing appropriate supervisory signals for optimization (Sutton et al., 1998). On one hand, as task difficulty increases, so does the complexity and length of the reasoning chain. LLMs struggle with the accumulation of errors and uncertainties across multiple intermediate steps (Lightman et al., 2023; Yu et al., 2023a; Zhang et al., 2023). The increase of reasoning steps leads to an exponential growth in the search space for reasoning, making it challenging to obtain correct final results (Xie et al., 2023). On the other hand, existing methods for supervised signals require a trade-off between feedback quality and annotation cost (Uesato et al., 2022). Outcome supervision (OS, Cobbe et al., 2021; Yu et al., 2023a) rewards only the final outcome (top center in Figure 1), but sparse rewards make it difficult to determine which actions led to success or failure (Wang et al., 2023b). Process supervision (PS, Uesato et al., 2022; Lightman et al., 2023) provides detailed feedback at every step of reasoning (top right in Figure 1), but this approach requires highly skilled annotators to select better reasoning paths, significantly increasing costs (Lightman et al., 2023).

In this work, we propose R3: Learning Reasoning through Reverse Curriculum Reinforcement Learning (bottom in Figure 1) to address the limitations. It employs only outcome supervision to achieve an effect similar to process supervision. Specifically, R3 let the model begin reasoning from a state sampled from a correct demonstration, and provide feedback to supervise the generated actions with outcome supervision. By slowly moving the start state from the end of the demonstration to the beginning, the model faces an easy exploration problem at each point where it is likely to succeed, since it has already learned to solve most of the remaining parts. In this way, a curriculum of gradually increasing exploration difficulty is created, and we can provide approximately step-by-step supervisory signals for the model.