Atom-Searcher: Enhancing Agentic Deep Research via Fine-Grained Atomic Thought Reward
Large language models (LLMs) exhibit remarkable problem-solving abilities, but struggle with complex tasks due to static internal knowledge. Retrieval-Augmented Generation (RAG) enhances access to external information, yet remains limited in multi-hop reasoning and strategic search due to rigid workflows. Recent advancements in agentic deep research empower LLMs to autonomously reason, search, and synthesize information. However, current approaches relying on outcome-based reinforcement learning (RL) face critical issues such as conflicting gradients and reward sparsity, limiting performance gains and training efficiency. To address these, we first propose Atomic Thought, a novel LLM thinking paradigm that decomposes reasoning into fine-grained functional units. These units are supervised by Reasoning Reward Models (RRMs), which provide Atomic Thought Rewards (ATR) for fine-grained guidance. Building on this, we propose Atom-Searcher, a novel RL framework for agentic deep research that integrates Atomic Thought and ATR. Atom-Searcher uses a curriculum-inspired reward schedule, prioritizing process-level ATR early and transitioning to outcome rewards, accelerating convergence on effective reasoning paths.
a new search paradigm, termed Agentic Deep Research system, has been proposed, which enables autonomous reasoning, on-demand searching, and iterative information synthesis. Demonstrations from recent deep research systems by OpenAI OpenAI (2025) and Google Google (2024) reveal several key advantages of this paradigm: 1) Comprehensive Understanding: Effectively handles complex, multi-step queries that challenge traditional methods Wei et al. (2022); 2) Enhanced Synthesis: Integrates diverse and even conflicting sources into coherent, informative outputs Cheng et al. (2025); 3) Reduced User Effort: Automates tedious search processes, easing users’ cognitive and manual burden Sami et al. (2024).
Early implementations of agentic deep research relied on prompt engineering et al. (2024); Kim et al. (2024) and supervised fine-tuning (SFT) Zhang et al. (2024). Yet, prompt-based methods rely heavily on LLMs’ instruction-following and long-context capabilities, whereas SFT tends to generalize poorly across domains Chu et al. (2025). More recently, post-training LLMs via reinforcement learning with outcome-based rewards (outcome-based RL) has yielded notable gains in reasoning performance Guo et al. (2025); OpenAI (2024). Building on this insight, recent advances Dai et al. (2025); Yang et al. (2025b,c) (e.g. Search-R1 Jin et al. (2025) and DeepResearcher Zheng et al. (2025)) treat the search tool as part of the environment and apply outcome-based RL to enable end-to-end optimization of the entire workflow, resulting in more performant and generalizable agentic deep research systems. Although outcome-based RL has shown promise, it remains insufficient in fully advancing agentic deep research, for the following reasons: 1) Gradients Conflicts: In the outcome-based RL paradigm, an incorrect final answer results in the entire trajectory being penalized Lightman et al. (2023), even when intermediate reasoning process or research strategies are effective. This coarse-grained reward design introduces potential gradient conflicts between intermediate reasoning steps and final answers, which hinders the model from discovering better reasoning capabilities and research strategies, thereby limiting its generalization ability. 2) Reward sparsity: Outcome-based RL relies solely on the final answer to generate rewards Du et al. (2024), resulting in each training sample providing only sparse feedback. This severely limits the efficiency of policy optimization, as it increases the reliance on larger training datasets and prolonged training schedules.
To address these challenges, we begin by introducing Atomic Thought, a novel LLM thinking paradigm that decomposes reasoning into fine-grained functional units, called Atomic Thoughts, guiding LLMs to engage in clearer and more in-depth reasoning, as illustrated in Figure 2. For example, reasoning operations like Reflection> and Verification> serve as Atomic Thoughts. Their interactions constitute the functional backbone of the reasoning process. To promote generalization, we avoid manual decomposition of Atomic Thoughts and instead encourage the model to autonomously induce them from reasoning processes. Building on this definition, we employ a Reasoning Reward Model (RRM) to score the generated Atomic thoughts and construct fine-grained Atomic Thought Reward (ATR). The ATR serves as an auxiliary signal to calibrate the outcome reward, thereby mitigating gradient conflicts during policy optimization. To aggregate the ATR and outcome reward, we design an curriculum-inspired strategy. During the early stages of training, the model is in a solution path exploration phase: while it may struggle to produce fully correct final answers, it can more easily develop partially correct reasoning traces. Relying solely on outcome rewards at this stage may induce severe gradient conflicts, thus requiring stronger calibration. As training advances, the alignment between reasoning and answers improves, reducing gradient conflicts and necessitating weaker calibration to avoid introducing excessive noise. Accordingly, we employ a linearly decaying weighting scheme, wherein the contribution of the ATR is gradually reduced as training proceeds. In addition, the hybrid reward incorporates process-level signals into the outcome-based reward, alleviating the problem of reward sparsity. Building on the above components, we propose Atom-Searcher, a novel RL framework for agentic deep research, aimed at advancing the performance frontier of agentic deep research models.