Eliciting Reasoning in Language Models with Cognitive Tools

Paper · arXiv 2506.12115 · Published June 13, 2025
Reasoning ArchitecturesArgumentationPhilosophy SubjectivityLinguistics, NLP, NLULLM Architecture

The recent advent of reasoning models like OpenAI’s o1 was met with excited speculation by the AI community about the mechanisms underlying these capabilities in closed models, followed by a rush of replication efforts, particularly from the open source community. These speculations were largely settled by the demonstration from DeepSeek-R1 that chains-of-thought and reinforcement learning (RL) can effectively replicate reasoning on top of base LLMs. However, it remains valuable to explore alternative methods for theoretically eliciting reasoning that could help elucidate the underlying mechanisms, as well as providing additional methods that may offer complementary benefits. Here, we build on the long-standing literature in cognitive psychology and cognitive architectures, which postulates that reasoning arises from the orchestrated, sequential execution of a set of modular, predetermined cognitive operations. Crucially, we implement this key idea within a modern agentic tool-calling framework. In particular, we endow an LLM with a small set of “cognitive tools” encapsulating specific reasoning operations, each executed by the LLM itself. Surprisingly, this simple strategy results in considerable gains in performance on standard mathematical reasoning benchmarks compared to base LLMs, for both closed and open-weight models. For instance, providing our “cognitive tools” to GPT-4.1 increases its pass@1 performance on AIME2024 from 26.7% to 43.3%, bringing it very close to the performance of o1-preview. In addition to its practical implications, this demonstration contributes to the debate regarding the role of post-training methods in eliciting reasoning in LLMs versus the role of inherent capabilities acquired during pre-training, and whether post training merely uncovers these latent abilities.

While the release notes associated with these models confirmed that reinforcement learning (RL) played a crucial role in enhancing reasoning capabilities, the specific mechanisms remained opaque, fueling intense speculation within the open research community. Proposed hypotheses ranged from pipelines leveraging curated fine-grained reward labels [Uesato et al., 2022, Lightman et al., 2023, Ma et al., 2023, Wang et al., 2024], to self-correction and algorithmic approaches inspired by Monte Carlo Tree Search [Hosseini et al., 2024, Xie et al., 2024, Liang et al., 2024]. This debate was partially resolved when subsequent work by DeepSeek demonstrated that relatively simple post training recipes combining “cold-start” supervised fine-tuning on curated reasoning traces with RL optimization on verifiable rewards [Lambert et al., 2025] could produce high-quality reasoning on par with the best closed models [Guo et al., 2025].

While the release notes associated with these models confirmed that reinforcement learning (RL) played a crucial role in enhancing reasoning capabilities, the specific mechanisms remained opaque, fueling intense speculation within the open research community. Proposed hypotheses ranged from pipelines leveraging curated fine-grained reward labels [Uesato et al., 2022, Lightman et al., 2023, Ma et al., 2023, Wang et al., 2024], to self-correction and algorithmic approaches inspired by Monte Carlo Tree Search [Hosseini et al., 2024, Xie et al., 2024, Liang et al., 2024]. This debate was partially resolved when subsequent work by DeepSeek demonstrated that relatively simple post training recipes combining “cold-start” supervised fine-tuning on curated reasoning traces with RL optimization on verifiable rewards [Lambert et al., 2025] could produce high-quality reasoning on par with the best closed models [Guo et al., 2025].

Recently, a critical reanalysis of the role of RL in eliciting reasoning in LLMs has added a new intriguing chapter to this story by pushing the narrative that the inherent capabilities of base models might be as important as RL (if not more) in enabling reasoning. In particular, Liu et al. [2025] observed that base models on which open reasoning LLMs are often built – like Qwen2.5-Base and DeepSeek-V3-Base – already spontaneously demonstrate strong reasoning capabilities and exhibit “Aha moment” self-reflection patterns that have been purported as indicative of emerging reasoning behavior. Yue et al. [2025] went a step further by showing that the reasoning traces generated by RL-fine-tuned models are already present in the base models’ generated responses if sampled sufficiently. This observation prompts them to propose that the role of RL is to bias the generation toward samples with high reward, thereby harnessing the strong reasoning capabilities that are already inherent in the base model, rather than infusing new ones.

Given these results and their implications that RL is not strictly necessary for reasoning but is merely helping “uncover” reasoning from already strong base models, it is natural to ask what other strategies might be used to elicit reasoning. Exploring alternative methods could be valuable to help theoretically elucidate the mechanisms underlying reasoning in LLMs, as well as offering complementary approaches that may provide additional benefits.

Recent work by Kramer and Baumann [2024] pointed out that cognitive psychology and cognitive sciences in general are the obvious disciplines to base investigations on the mechanisms underlying reasoning. In particular, those authors took inspiration from the foundational cognitive architectures framework by Anderson et al. [1997], which posits that human reasoning arises from the structured execution of stereotyped cognitive operations that are orchestrated into sequences apt at problem solving. Kramer and Baumann [2024] proposed a prompt engineering implementation of these ideas that they called “cognitive prompting”, consisting essentially in prompts that are structured so as to enable LLMs to break problems into stages like goal clarification, decomposition, and integration. Cognitive prompting was shown to significantly enhance arithmetic and commonsense reasoning capabilities of LLMs.

We build upon this work by going one step further in realizing the cognitive architecture idea that reasoning comes about as the orchestrated execution of modular cognitive operations that can be flexibly structured depending on the context at hand. We argue that the cognitive prompting approach is missing the important element of modularity, i.e., an implementation of cognitive operations that are encapsulated as discrete tools rather than a predetermined monolithic prompt. Modularity has long been proposed as a principle to reduce interference between operations in neural network (e.g. Soldal [2012]), and it’s been shown to be associated to compositional generalization in neuroscience studies [Ito et al., 2022]. Taking inspiration from modern Agentic AI we instantiate modular and compartmentalized cognitive operation in LLMs within a tool-calling architecture where each cognitive operation is implemented as a dedicated, self-contained function. But while in agentic tool-calling frameworks, tools are external functions or APIs (e.g., calculators, search engines) with predefined schemas that LLMs invoke to execute tasks outside their parametric knowledge, in the case of our “cognitive tools” they encapsulate reasoning operations within the LLM itself. Each cognitive tool’s schema includes a prompt template that isolates a specific cognitive operation. When invoked, the LLM executes this prompt in a sandboxed context, generating a structured intermediate result that is fed back into the main reasoning loop. Unlike general tools, which interface with external systems, cognitive tools modularize the LLM’s internal reasoning processes.

3 Methodology: Cognitive Tools

We propose using cognitive tools to elicit the reasoning capabilities of LLMs. We identify four cognitive tools: understand question, recall related, examine answer, and backtracking. For a given question, the LLM is encouraged to use tools as needed to help it solve the problem correctly by guiding its reasoning. The execution pipeline is similar to any tool-calling pipeline meaning: the is prompted to LLM generate a reasoning trace in response to a query until a call to one of the provided tools t is issued. Once that is detected, we stop the generation and execute the module that encapsulates the tool t. In our case, each tool represents a call to an LLM (the same as the one reasoning) with the specific tool role. The output of the execution is provided back to the LLM issued the tool call which continues to reason about the problem until the end of sentence token. This procedure is related to token forcing and budget forcing with test-time scaling introduced in the s1 paper [Muennighoff et al., 2025]. Our work also takes inspiration from that work, but provides additional flexibility as we let the LLM “decide” which tool to call and when to call one, as if the LLM were left to autonomously and flexibly implement budget forcing when deemed appropriate.

Understand Question The cognitive architectures literature [Anderson et al., 1997] emphasizes the importance of goal management in replicating human reasoning, which operates by breaking down a problem at hand to identify its key components. We implement this process into what we call the “understand question” tool. The role of this cognitive tool is to prompt the LLM to perform such task. The tool breaks down the problem by identifying the main concepts at hand, extracting relevant information in the question, and highlighting meaningful properties, theorems, and techniques that might be helpful in solving the problem.

Recall Related This tools is inspired by the work Yasunaga et al. [2024] which introduces a prompting technique which consists of asking a model to recall previous knowledge to guide its own reasoning. In our case, for a given question, the tool provides relevant related knowledge of similar questions which it knows how to answer together with the corresponding answer. The objective is then to guide the LLM through those examples towards the way it can follow to solve the question in hand.

Examine Answer The role of this cognitive tool is to examine the current trace of reasoning of the LLM when trying to find the answer to a question. In other words, it implements a form of ‘self-reflexion’, an operation which has been demonstrated to be effective for reasoning [Shinn et al., 2023]. In practice, this cognitive tool checks the current reasoning trace for possible flaws, wrong assumptions, miscalculations, or constraints which are not taken into account. Thus, it helps the LLM to reconsider what it reasoned about which in return can help correct what was previously missed. Backtracking When faced with an incorrect solution to a problem or when one realizes that one’s train-of-thought is flawed, the next action is to backtrack to a previously correct step and explore alternative, an idea related to Monte Carlo Tree search (see e.g. Liang et al. [2024]). This defines the idea behind this tool, which to enable exploration of more promising reasoning paths. When the LLM decides to use this cognitive tool, the tool prompts the LLM to consider the current reasoning trace, summarizing it and breaking it down into steps. The LLM then proceeds to evaluating which step in the reasoning process is incorrect and follows up by providing alternative approaches or directions into solving the problem.