Training Language Models to Self-Correct via Reinforcement Learning

Paper · arXiv 2409.12917 · Published September 19, 2024
Self Refinement Self Consistency FeedbackFlaws

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM’s self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model’s own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model’s own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt.

The most prominent settings include problems where external input tokens from an environment are available, such as agentic tasks (Liu et al., 2023), code repair (Jain et al., 2024), and tool use (Chen et al., 2023).

Recent work demonstrates that naïvely prompting LLMs for self-correction can degrade performance

there is no major work showing successful intrinsic self-correction via prompting alone.

Our work aims to train for self-correction entirely without the use of larger models or humans, when the learner itself is asked to generate its own training data. Similar to these prior works, we assume access to a reward function

Prior work at the intersection of LLMs and multi-turn RL builds machinery for optimizing rewards with value-based (Farebrother et al., 2024; Shani et al., 2024; Snell et al., 2022; Zhou et al., 2024), policy-based (Shao et al., 2024; Xiong et al., 2024), and model-based (Hong et al., 2024) approaches. We do not focus on building machinery for RL