Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
To address these issues, we introduce Meta- Reasoner, a framework that dynamically optimizes inference-time reasoning by enabling LLMs to “think about how to think.” Drawing inspiration from human meta-cognition and dual-process theory, Meta-Reasoner operates as a strategic advisor, decoupling high-level guidance from step-by-step generation. It employs contextual multi-armed bandits to iteratively evaluate reasoning progress and select optimal strategies (e.g., backtrack, clarify ambiguity, restart from scratch, or propose alternative approaches), and reallocates computational resources toward the most promising paths. Our evaluations on mathematical reasoning and puzzles highlight the potential of dynamic reasoning chains to overcome inherent challenges in the LLM reasoning process and also show promise in broader applications, offering a scalable and adaptable solution for reasoning-intensive tasks.
1 Introduction
Recent advances in o1-like and r1-like1 reasoning have enabled large language models (LLMs) to achieve remarkable performance on complex tasks such as mathematics (Patel et al., 2024; Lightman et al., 2023), science (Rein et al., 2023), and logical puzzles (Lei et al., 2024; Yao et al., 2023). By simulating multi-step, human-like deliberation (Yao et al., 2024), these methods allow LLMs to decompose problems into smaller subproblems, test hypotheses, reflect on intermediate results, and iteratively refine their solutions. This extended reasoning process enables systematic exploration of ideas, verification of partial conclusions, and progressive improvement before producing a final answer. Such capabilities are particularly valuable in domains demanding rigorous logical reasoning (Chenghao Yang, 2024).
Despite these advances, o1/r1-like reasoning remains fundamentally challenged by its trial-and-error nature: models generate numerous candidate reasoning paths, discard flawed ones, and gradually converge on solutions. While this flexibility facilitates exploration of diverse strategies, it often incurs substantial computational overhead (Snell et al., 2024; Manvi et al., 2024) and is vulnerable to error propagation, where early mistakes accumulate and compromise subsequent steps (Lei et al., 2024; Yao et al., 2023; Gandhi et al., 2024). Some iterative methods incorporate partial revision or backtracking (Gandhi et al., 2024; Li et al., 2025a), but these approaches tend to be ad-hoc and limited to correcting errors within a narrow reasoning window. Crucially, they lack a systematic mechanism to assess whether an entire reasoning trajectory remains promising or should be abandoned.
As a result, LLMs risk becoming “stuck” on unproductive reasoning paths, wasting valuable computational resources without recognizing when a strategic pivot is necessary. A critical challenge, therefore, is to enable LLMs to manage their reasoning budget more effectively—prioritizing promising directions while adapting or discarding ineffective strategies during inference time.
To address this challenge, we propose Meta- Reasoner, a specialized meta-reasoning module that operates alongside the LLM to enhance its reasoning capabilities. Acting as a high-level advisor, the meta-reasoner dynamically evaluates the reasoning process and provides strategic guidance or redirection when progress stalls. Unlike the LLM, which focuses on detailed stepwise generation, the meta-reasoner maintains a global perspective, assessing overall progress and strategy from a high level. Meta-Reasoner operates in iterative rounds: First, the LLM generates partial chain-of-thought reasoning chains and a concise “progress report” summarizing its current state. Meta-Reasoner then reviews this report and offers high-level feedback - such as restarting reasoning with a different approach, refining existing ideas, or focusing on specific subproblems. This setup allows Meta- Reasoner to concentrate on overall strategy rather than getting involved in the granular details of the LLM’s reasoning. Overall, Meta-Reasoner helps prevent the LLM from getting stuck or spending resources on unproductive lines of inquiry during the inference time.
Backtracking and Self-Verification To mitigate the limitations of CoT-like reasoning, recent methods have explored backtracking and selfverification techniques (Yao et al., 2023; Besta et al., 2023; Gandhi et al., 2024). For example, Weng et al. (2023) demonstrate that incorporating a self-verification stage—where the model reexamines its conclusions using the generated chain of thought—significantly improves performance by detecting errors early. Similarly, Ling et al. (2023) propose generating multiple candidate reasoning chains alongside a verifier mechanism that identifies and backtracks on erroneous steps. These approaches extend beyond post-hoc validation by enabling dynamic strategy adjustments during inference (Lightman et al., 2023), thereby limiting error propagation in lengthy reasoning chains and mitigating infinite reasoning loops. Building on these efforts, our Meta-Reasoner framework employs instructions to (1) restart from scratch with alternative strategies, (2) backtrack to the error point, and (3) continue with targeted suggestions. Further details on this strategy are provided in §4.3. Meta-Cognition&Dual-Process Systems From a cognitive science perspective, meta-cognition involves higher-order processes that allow individuals to monitor, evaluate, and adjust their cognitive strategies (Gao et al., 2024; Yoran et al., 2024). This reflective thinking—often characterized as System 2 in dual-process theories (Havrilla et al., 2024)—is vital for tasks requiring careful deliberation and error correction (Didolkar et al., 2024). Drawing on these insights, our Meta- Reasoner framework can be viewed as analogous to dual-process systems: the LLM generates CoT steps akin to System 1, while the Meta-Reasoner provides high-level strategic oversight, analogous to System 2, guiding or redirecting reasoning as needed.
two key research questions in this paper: (1) How can language models dynamically allocate resources during inference to optimize reasoning and planning?; (2) What architectural design enables an effective separation between the reasoning process within the LLM and the meta-level guidance that oversees it?
To address these questions, we propose a novel framework, Meta-Reasoner, which endows LLMs with the ability to “think about how to think”. Our framework supervises the reasoning process and dynamically guides the model to focus on more promising reasoning trajectories during inference time. Furthermore, Meta-Reasoner mitigates the limitations of conventional sequential reasoning, which may get stuck in suboptimal paths. We propose a “high-order" reasoning mechanism to balance exploration and exploitation using a Multi- Armed Bandit (MAB) algorithm.
The meta-reasoning framework operates iteratively as illustrated in Figure 1. At each round t, the reasoning process comprises three steps: (1) CoT generation by the LLM, (2) Progress Reporting to summarize the reasoning progress so far (i.e., this is partly for efficiency, and partly to help the metareasoner focus on its main goal of “advising” rather than being distracted by the details in the CoT), and (3) Strategy Generation by the meta-reasoner to optimize subsequent steps. The selection of the strategy is almost exactly corresponds to the well-studied problem of contextual multi-armed bandits as illustrated in §3. Each strategy can be seen as an arm for the bandit, and the reward of each strategy can be evaluated by the progress of LLM reasoning after applying the strategy. We analogy the process of executing and evaluating each strategy as the act of “pulling” each arm. The overall goal of our meta-reasoner is to find the best arm