Agentic Reasoning: Reasoning LLMs with Tools for the Deep Research

Paper · arXiv 2502.04644 · Published February 7, 2025

Agentic Reasoning, a framework1 that enhances large language model (LLM) reasoning by integrating external tool-using agents. Unlike conventional LLM-based reasoning approaches, which rely solely on internal inference, Agentic Reasoning dynamically engages web search, code execution, and structured reasoning-context memory to solve complex problems requiring deep research and multistep logical deduction. Our framework introduces the Mind Map agent, which constructs a structured knowledge graph to track logical relationships, improving deductive reasoning. Additionally, the integration of web-search and coding agents enables real-time retrieval and computational analysis, enhancing reasoning accuracy and decision-making. Evaluations on PhD-level scientific reasoning (GPQA) and domain-specific deep research tasks demonstrate that our approach significantly outperforms existing models, including leading retrieval-augmented generation (RAG) systems and closed-source LLMs.

DeepSeek-R1, for example, relies exclusively on rule-based outcome rewards during training, such as evaluating whether a mathematical solution is correct or a piece of code executes successfully. While this approach has yielded remarkable reasoning capabilities, equaling o1’s performance in domains like math and code, it comes with notable trade-offs. As even the authors acknowledge, this type of training diminishes the model’s ability to articulate its reasoning process. DeepSeek-R1’s responses are often logical and accurate but lack detailed explanations of transitions between ideas or the finer connections between arguments.

Although current reasoning methods excel in structured domains like math and code—where outcomes are easily verifiable—applying these techniques to less structured or subjective tasks remains a significant challenge. Adapting these strategies to areas where answers are not inherently definitive is a key research gap. How can models be trained to handle tasks that require judgment, interpretation, or nuanced understanding rather than binary correctness?

Furthermore, not all problems benefit from formal reasoning approaches. Many fields, such as social sciences, ethics, or experiential disciplines, rely on abstract concepts, conventional wisdom, factual verification, understanding complex logical relationships, or moral reasoning. When models attempt to impose math- or coding-style reasoning onto such areas, they often produce flawed or overly rigid results. Developing approaches that account for these unique requirements is essential for advancing the applicability of reasoning model beyond their current domains.

Deep, thoughtful answers to open-ended questions often require extensive research, repeated verification, information retrieval, computational analysis, and the organization of complex logical relationships— steps fundamental to human reasoning. In this process, humans rely heavily on external tools, such as internet searches for gathering information, computational tools for quantitative analysis, or whiteboards and Mind Maps for organizing thoughts. This raises an intriguing question: can large language models similarly leverage external tools to enhance their reasoning and tackle intensive knowledge work across diverse domains?

This approach enables LLMs to perform multi-step reasoning and tackle complex problems more effectively by delegating specific tasks to these auxiliary agents. Through extensive experimentation with integrating various agents into the reasoning process, we identified three essential agents that prove highly effective for general reasoning across diverse problems.

The code agent, capable of performing computational analyses and coding tasks to support quantitative reasoning. Finally, the memory agent, which we call Mind Map, constructs knowledge graphs based on the reasoning context, enabling the organization of complex logical relationships in a manner similar to human mind mapping. Together, these agents enhance the model’s ability to tackle complex problems with greater efficiency and precision.

two key areas: (1) solving expert-level questions and (2) conducting deep research on real-world expert-level tasks.

For expert-level questions, we tested the model on the GPQA dataset, a PhD-level science multiplechoice QA benchmark with questions authored by domain experts in physics, chemistry, and biology. Our Agentic Reasoning framework achieved impressive accuracy rates: 58% in chemistry, 88% in physics, and 79% in biology, closely rivals the best and newest closed reasoning model, OpenAI o1. For real-world expert-level tasks, Agentic Reasoning was evaluated by domain experts, who noted that it effectively automated several hours of challenging, manual investigation. This highlights its potential to streamline labor-intensive processes and enhance productivity in knowledge-intensive domains.

Our core idea is to enhance the model reasoning by deploying external LLM-based agents during reasoning. The framework enables the reasoning LLM model interacts with external information in an agentic way. During its reasoning process, it could call the external tools to help solve the problem and also with a structured memory, called Mind Map, to store its reasoning context. At its core, an agentic mechanism empowers the model to determine, in real-time, when additional information is required. whenever the model identify the external information is needed during its reasoning, it will proactively embeds specialized tokens into its reasoning tokens. These tokens can be generally categorized to web-search token, coding token, and mind-map calling token. Together with token, the reasoning model would also generate a precise query as a message to interact with these external agents, based on the reasoning context developed so far.

Upon detecting such a token, the reasoning process temporarily halts to extract the query and its reasoning context. Those are then dispatched to external agents, such as search engines or Mind Map, to generate pertinent content. The generation would consider both the message received and the reasoning context to make sure returning the most relevant results. These results are then reintegrated into the reasoning chain, allowing the model to continue its inference with an updated and enriched knowledge.

This iterative retrieval-and-reasoning cycle continues as needed, enabling the model to dynamically refine its conclusions until it reaches a fully reasoned final answer.

2.3 Mind Map Agent We construct a Mind Map to store and structure the real-time reasoning context of the reasoning model. This Mind Map is built by transforming raw reasoning chains into a structured knowledge graph. Specifically, we use a graph-construction LLM to extract entities from the reasoning chain and identify semantic relationships between related entities, following a process similar to that used in GraphRAG (Edge et al., 2024).

The Mind Map serves two primary functions. First, it clusters reasoning context into distinct groups and summarizes each theme. This is achieved by applying community clustering (Edge et al., 2024) on the knowledge graph and using an LLM to generate concise summaries for each group. Second, the knowledge graph can be queried with specific questions, such as “Who was Jason’s maternal great-grandfather?” Using standard retrievalaugmented generation (RAG) on the knowledge graph (Edge et al., 2024), we retrieve and return relevant information.