Reasoning Language Models: A Blueprint

Paper · arXiv 2501.11223 · Published January 20, 2025

such as OpenAI’s o1 and o3, DeepSeek-V3, and Alibaba’s QwQ, have redefined AI’s problem-solving capabilities by extending large language models (LLMs) with advanced reasoning mechanisms. Yet, their high costs, proprietary nature, and complex architectures—uniquely combining Reinforcement Learning (RL), search heuristics, and LLMs—present accessibility and scalability challenges. To address these, we propose a comprehensive blueprint that organizes RLM components into a modular framework, based on a survey and analysis of all RLM works. This blueprint incorporates diverse reasoning structures (chains, trees, graphs, and nested forms), reasoning strategies (e.g., Monte Carlo Tree Search, Beam Search), RL concepts (policy, value models and others), supervision schemes (Outcome-Based and Process-Based Supervision), and other related concepts (e.g., Test-Time Compute, Retrieval-Augmented Generation, agent tools). We also provide detailed mathematical formulations and algorithmic specifications to simplify RLM implementation. By showing how schemes like LLaMA-Berry, QwQ, Journey Learning, and Graph of Thoughts fit as special cases, we demonstrate the blueprint’s versatility and unifying potential.

Our blueprint comprehensively encompasses the potential building blocks of RLMs, offering a flexible and modular framework. It incorporates a variety of reasoning structures, such as chains, trees, graphs, and even higher-order structures such as hierarchical (or nested) trees, along with numerous operations that transform and advance the reasoning process. The blueprint supports different granularities of reasoning steps, ranging from individual tokens to full sentences or structured segments. Additionally, it enables diverse training schemes, including Outcome-Based Supervision (OBS) and PBS, and the related Outcome & Process Reward Models (ORMs & PRMs). Next, in order to illustrate the capability of the blueprint to accommodate novel design ideas, we describe several novel schemes and how they fit within the blueprint. One such example is Trace-Based Supervision (TBS), which extends PBS by incorporating labeled traces of traversal paths through entire reasoning structures, rather than just linear chains of reasoning steps. By unifying all these components, our blueprint serves as

3.1.1 Inference The inference process begins when the user provides an input prompt 1 , which typically describes the problem or question to be addressed by the RLM. This input serves as the root of the reasoning process and initiates the construction of a reasoning structure 2 that organizes RLM’s progress. The structure is usually represented as a tree. The root of this tree corresponds to the user’s input, and subsequent nodes are generated to explore the search space – the domain of possible reasoning paths or solutions. The purpose of this reasoning structure is to systematically investigate potential solutions, progressively refining and extending reasoning paths to converge on an optimal or satisfactory answer. An individual point in the search space, represented as a node in the reasoning structure, corresponds to a reasoning step 3 . A reasoning step is defined as a coherent and self-contained unit of thought – a sequence of tokens that advances the solution by either exploring a new branch of the problem or building upon existing progress. These steps form the building blocks of the reasoning process. The details of how the structure evolves are usually governed by the MCTS scheme, enhanced with policy and value models (we also distinguish other reasoning strategies, described below). This approach, inspired by methods used in AlphaZero, ensures that the search process is both efficient and directed toward promising solutions. The policy model 4 is responsible for generating new reasoning steps at each node, predicting the next most likely and logical steps to expand the reasoning process. Meanwhile, the value model 5 evaluates the quality of a reasoning path starting at a given node, helping the system prioritize the most promising steps to follow. Sometimes, a reward model3 6 is used instead, to assess the quality of an individual specific node and its corresponding reasoning step. In our blueprint, as detailed in the next section, we abstract the models into a more general notion of operators 7 to enable more flexibility in how they are implemented. The search and reasoning processes continue iteratively until a terminal step is reached 8 . This terminal step represents a completion of the reasoning chain that forms the final answer to the posed problem.

The data can be obtained from inference, assuming quality control [52], but also from a dedicated synthetic data generation pipeline that mirrors that of the inference. To collect the data, one executes the respective RLM pipeline for a given input task and gathers the results 6 ; depending on how detailed the gathering process is, the data collected can contain only outcomebased labels 7 , process-based labels 8 , or some other variant such as trace-based labels 9 suggested in our blueprint, that generalize process-based samples to samples that contain also information about operators applied during the task solution process. All this data becomes a part of the replay buffer 10 and is used in the unsupervised training

action space involves transitioning between different reasoning structures rather than individual steps. This approach changes the nature of the search, as the focus shifts from iteratively constructing a single reasoning path to evaluating and refining entire structures within the search space. Our blueprint accommodates this with the concept of nesting, where a node in the reasoning structure can contain another reasoning structure.

Journey Learning [113] adds an additional layer of complexity by incorporating a transformation step that “rewires” the search or reasoning structure. This transformation consolidates multiple paths in the tree, synthesizing them into a new form that is used as input for subsequent reasoning iterations.

The blueprint specifies a toolbox of components that can be used to build an arbitrary RLM. We identify several classes of such components. First, an RLM includes a reasoning scheme, which specifies a reasoning structure (e.g., a tree) together with a reasoning strategy (e.g., MCTS) of how this structure evolves in order to solve a given input task. Second, there is a set of operators (e.g., Refine) that can be applied to the reasoning structure (as specified by the reasoning strategy) in order to evolve it and make progress towards solving the input task. Operators are specified based on what they do (i.e., what effect they have on the reasoning structure). How this effect is achieved, depends on how a given operator is implemented. Here, many operators rely on neural models (e.g., Policy Model), which – together with their training paradigms – form the third class of the blueprint components. Finally, we also distinguish a set of pipelines, i.e., detailed specifications of operations that orchestrate the interaction between the reasoning scheme and the operators in order to achieve a specific objective, such as training, inference, or data generation. Hence, an RLM can be defined as a composition of a reasoning scheme, a set of operators and associated models, and a set of pipelines.

A reasoning step is a fundamental unit of the reasoning structure – a sequence of tokens that advances the RLM towards the solution. Reasoning steps can vary in length, ranging from a single token to entire segments of text. The variability in their granularity depends on the user design choice. In existing schemes, a reasoning step is typically conceptualized as a “coherent and self-contained unit of thought”. For instance, in mathematical proofs, this may correspond to an individual logical argument or deduction. The flexibility in defining reasoning steps allows models to adapt to different problem domains, balancing finegrained and coarse-grained reasoning. Coarse steps, such as logical arguments (or even complete reasoning pathways [169]), simplify preparation and adoption of training data, enhance interpretability, and – as we discuss in Section 8 – reduce computational overhead. On the other hand, single-token steps enable the utilization of concepts like token entropy [96] to incorporate the model’s uncertainty, as well as the integration of advanced decoding schemes (e.g., speculative decoding [77] or contrastive decoding [80]) explicitly into the RLM design. Yet, while making the reasoning steps more fine-grained allows for a more detailed exploration of solution paths, this increased flexibility results in greater computational demands, particularly when combined with search algorithms such as MCTS.

Two primary families of reward models for such process-based tasks are Outcome-Based Reward Models (ORMs) and Process-Based Reward Models (PRMs). Figure 12 compares both classes of models.

Outcome-Based Reward Models (ORMs), first introduced by Uesato et al. [143], evaluate the reasoning process solely based on the final outcome. These models estimate the reward of the final step in the chain, often modeled in the literature as the likelihood of a correct final answer given the entire reasoning chain P(correct(zT+1) | z0, ..., zT+1) [83], [143] where sT+1 := z0, ..., zT+1 is the complete reasoning chain consisting of reasoning steps zi and T + 1 marks the last reasoning step. ORMs are particularly ill-suited for evaluating intermediate steps for several reasons. First, the training data and objective are inherently misaligned with step-wise evaluation, as they focus exclusively on final outcomes. Second, ORM evaluations tend to be overly pessimistic for intermediate steps since a subsequent erroneous step can obscure the correctness of earlier steps. This observation aligns with Havrilla et al. [55], who noted that ORMs often underestimate the solvability of a problem from an intermediate state and are prone to a high false-negative rate. Furthermore, ORMs lack robustness against false positives, potentially favoring erroneous reasoning steps and misleading the evaluation process.

Process-Based Reward Models (PRMs), introduced by Lightman et al. [83] and Uesato et al. [143], evaluate reasoning on a step-by-step basis. These models estimate the reward of a step, which can be seen as the likelihood of correctness for the t-th step given its preceding context P(correct(zt) | z0, ..., zt) where st := z0, ..., zt is a potentially incomplete reasoning chain and zi are reasoning steps and z0 is the query. PRMs provide more fine-grained feedback and can pinpoint errors in the chain. This stepwise evaluation provides dense rewards given partial responses and helps identify where reasoning deviates from correctness, offering improved interpretability and enabling more targeted improvements in reasoning processes. However, PRMs are computationally expensive to train and require extensive annotations of reasoning steps. These annotations, whether provided by humans or other LLMs, often suffer from limitations: human annotations are scarce, costly, and prone to bias, while prompted LLM-generated annotations [146] are typically of lower quality due to their limited self-evaluation capabilities [94].

A reward model predicts immediate rewards. In RL, this corresponds to the reward obtained for a transition (s, a, s′) from state s when taking action a which results in step s′. For reasoning, this corresponds to adding a new reasoning step a to the structure. The new structure is then represented by s′. Specifically, PRMs – which are preferred over ORMs for MCTS due to the need for action-based evaluation – learn these rewards and can be used to evaluate states (or the transition into a state). This formulation provides a localized, step-level evaluation independent of the overall outcome of the reasoning chain. The reward model is typically trained using labeled data where individual reasoning steps are associated with reward values. While this localized view is advantageous for step-by-step evaluation, it lacks the ability to consider how the current step contributes to the long-term success of the reasoning process. This limitation motivates the introduction of value models.

V-Value Model (V-VM). One variant of a value model is the v-value model which predicts the expected cumulative future reward of a state, denoted as V (s). This is equivalent to the state value function in reinforcement learning, which evaluates the long-term potential of the current state s. A key advantage of V-VMs is their global perspective, as they aggregate future rewards across all possible trajectories originating from the current state. However, V-VMs do not explicitly evaluate individual actions, which may limit their utility in step-level decision-making. Additionally, v-values are often ill-defined at terminal states, where rewards may substitute for state values during training. Q-Value Model (Q-VM). Another variant of a value model is the q-value model. Q-VMs predicts the expected cumulative future reward of taking a specific action a in a given state s, denoted as Q(s, a). Unlike V-VMs, Q-VMs explicitly associate values with state-action pairs, offering a more granular evaluation. This granularity makes Q-VMs particularly useful for MCTS, where decisions about which edge (action) to expand at a given node (state) are critical. By directly evaluating actions, Q-VMs align naturally with the selection mechanisms in MCTS, guiding the search toward promising paths. Similar to V-VMs, Q-VMs can also be categorized as PQVMs (Process-based Q-Value Models), OQVMs (Outcome-based Q-Value Models), and O-PQVMs (Outcome-driven Process-based Q-Value Models). The choice between V-VMs and Q-VMs depends on the reasoning task and the specific requirements of the evaluation framework. While V-VMs provide a broader, state-centric evaluation, Q-VMs enable more precise, actionspecific guidance. In practice, MCTS often benefits from the use of Q-VMs due to their compatibility with edge-based selection.