Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time Thinking

Paper · arXiv 2503.19855 · Published March 25, 2025

Recent advances in large language models (LLMs), such as OpenAI-o1 and DeepSeek-R1, have demonstrated the effectiveness of test-time scaling, where extended reasoning processes substantially enhance model performance. Despite this, current models are constrained by limitations in handling long texts and reinforcement learning (RL) training efficiency. To address these issues, we propose a simple yet effective test-time scaling approach—Multi-round Thinking. This method iteratively refines model reasoning by leveraging previous answers as prompts for subsequent rounds. Extensive experiments across multiple models, including QwQ-32B and DeepSeek-R1, consistently show performance improvements on various benchmarks

Despite these successes, existing methods exhibit critical limitations. PRM face challenges such as defining fine-grained reasoning steps clearly, verifying intermediate reasoning correctness, and mitigating reward hacking (Amodei et al., 2016; Langosco et al., 2023), making automated labeling challenging and manual labeling impractical for scaling. Similarly, MCTS methods encounter difficulties due to vast search spaces, often causing models to become trapped in local optima, and depend heavily on sophisticated scoring models that are challenging to train (DeepSeek-AI, 2025).

We introduce a novel Multi-round Thinking approach designed to significantly enhance reasoning capabilities in large language models (LLMs). In contrast to traditional single-step reasoning methods, our approach iteratively refines answers through multiple rounds of inference. Each round takes the answer from the previous iteration (without intermediate reasoning steps) as part of a new input prompt, encouraging independent reconsideration and correction. This iterative process helps models avoid cognitive inertia, analogous to human strategies in overcoming entrenched errors in reasoning.