Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Paper · arXiv 2411.16579 · Published November 25, 2024

Training large language models (LLMs) to spend more time thinking and reflection before responding is crucial for effectively solving complex reasoning tasks in fields such as science, coding, and mathematics. However, the effectiveness of mechanisms like self-reflection and self-correction depends on the model’s capacity to accurately assess its own performance, which can be limited by factors such as initial accuracy, question difficulty, and the lack of external feedback. In this paper, we delve into a two-player paradigm that separates the roles of reasoning and critique models, where the critique model provides step-level feedback to supervise the reasoning (actor) model during both test-time and training-time.

Motivated by the insights of test-time, we introduce the critique model into the actor model’s exploration and learning process, introducing a critique-in-the-loop self-improvement method (Section 5). With the supervision of critique models and by scaling exploration computation for difficult queries, our method improves the actor’s exploration efficiency and solution diversity, alleviating the issue of tail narrowing [34] in reasoning models during iterative exploration and learning. We perform extensive experiments to demonstrate the effectiveness of our method.

We propose the self-talk-via-critique method, and train a single language model to reflect and self-correct at each step, demonstrating the potential of this approach. In summary, our main contributions are:

• We introduce AutoMathCritique, an automated and scalable framework for collecting step-level critique data without additional human supervision, which we use to build the large-scale critique dataset MathCritique-76k.

• We fine-tune the critique model with MathCritique-76k to offer constructive feedback on reasoning paths. We demonstrate and analyze the performance gains of the trained critique models in enhancing the actor’s reasoning during test time, particularly when scaling testtime computation.

• Motivated by the insights from test-time analysis, we introduce the critique model to the actor’s self-training process, and propose the critique-in-the-loop self-improvement method to enhance exploration efficiency and solution diversity, ultimately training better reasoning models.

• We conduct extensive experiments to validate the effectiveness of our method and perform in-depth analysis of critique models, e.g., their scaling properties, and whether we should scale test-time computation in sequential or parallel.

• We propose the self-talk-via-critique method, and take the preliminary step to train models that can perform step-level reasoning, reflection and correction, and demonstrate their potential. We hope our work offers valuable insights for future research on LLM reasoning and scalable supervision.