DeepAgent: A General Reasoning Agent with Scalable Toolsets

Paper · arXiv 2510.21618
Deep Research AgentsLLM AgentsChain-of-Thought and Reasoning MethodsTool Use and Computer-Use Agents

Large reasoning models have demonstrated strong problem-solving abilities, yet real-world tasks often require external tools and long-horizon interactions. Existing agent frameworks typically follow predefined workflows, which limit autonomous and global task completion. We introduce DeepAgent, an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution within a single, coherent reasoning process. To manage long-horizon interactions, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories, reducing error accumulation while preserving critical information. To teach general-purpose tool use efficiently and stably, we develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens.

Most existing agents follow predefined workflows (e.g., ReAct and Plan-and-Solve) with iterative "Reason-Act-Observe" loops. Although effective in simpler tasks, these approaches suffer from several critical limitations: (1) lack of autonomy in execution steps and overall procedure; (2) inability to dynamically discover tools during task execution; (3) deficiency in fully autonomous management of interactive memory; and (4) insufficient depth and coherence in reasoning about the entire task. DeepAgent operates by autonomously thinking, searching for, and executing actions. This paradigm shifts away from traditional, predefined workflows that rely on predefined task planning and iterative tool use. Instead, DeepAgent maintains a global perspective on the entire task, unconstrained by the need to deliberate on specific, isolated operations. Tools are not pre-retrieved in advance but are dynamically discovered on an as-needed basis, thereby fully unlocking the autonomous potential of the large reasoning model.

To facilitate robust exploration in long-horizon environments, we equip DeepAgent with Autonomous Memory Folding. This strategy allows the agent to dynamically consolidate its reasoning process and interaction history into a structured memory schema. Beyond reducing token overhead, this mechanism enables the agent to "take a breath"—pausing to reconsider strategies and avoid erroneous paths. To minimize information loss during consolidation, we introduce a brain-inspired memory architecture comprising episodic, working, and tool memory, all structured with an agent-usable data schema to ensure the stability and utility of the folded memory.

To enhance DeepAgent’s proficiency in mastering these mechanisms, we propose ToolPO, an end-to-end reinforcement learning (RL) training method tailored for general tool use. Existing agentic RL training in general domains presents two significant challenges: (1) The reliance on a multitude of real-world APIs during training can lead to instability, slow execution, and high costs. To prevent this, we leverage LLM-simulated APIs, which enhance the stability and efficiency of the training process. (2) A sparse reward based solely on the final outcome is often insufficient to guarantee the accuracy of intermediate tool calls. We address this by implementing tool-call advantage attribution, which precisely assigns credit to the specific tokens responsible for correct tool invocations, thereby providing a more granular and effective learning signal.

Since manual collection of expert trajectories is labor-intensive, costly, and difficult to scale, a key challenge lies in automatically constructing high-quality SFT datasets. This has been widely explored by prior work [367, 483, 336, 48]. Below, we categorize representative work into two main paradigms: (i) strong-to-weak distillation, distilling correct task-solving trajectories from powerful LLMs (e.g., GPT-5 and DeepSeek-V3.1) into smaller, weak models; and (ii) iterative self-evolution, iteratively fine-tuning the model on the dataset produced by itself, leading to a progressive improvement.