Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Note · 2026-02-21 · sourced from Deep Research

The test-time scaling framework — more inference compute yields better answers up to a threshold — has been documented for reasoning token budgets in chain-of-thought models. The Agentic Deep Research finding extends this to search: more search steps, more retrieval rounds, better answers. The relationship follows the same shape.

This matters because it multiplies the design space for inference-time compute. Before, the question was "how many tokens to think?" Now there are two axes: reasoning budget per query and search budget per query. They are not independent — longer chains may require more retrieval to validate intermediate steps, and more retrieval may require more reasoning to synthesize. The optimal allocation problem gets harder.

The practical implication is that "deep research quality" is not a fixed property of a model — it is a function of the search budget you give it. A mid-sized model with a large search budget can outperform a large model with a restricted one. This shifts cost optimization from training compute to inference architecture, specifically the retrieval loop.

The finding also reframes what "thinking harder" means for agents. For single-turn reasoning models, thinking harder means more tokens per response. For search agents, thinking harder means more search-retrieve-synthesize iterations. How should we balance parallel versus sequential compute at test time? applies here too: the question of whether to parallelize retrieval across multiple query variants (parallel) or chain them iteratively (sequential) is the same structural trade-off operating at the retrieval level.

CoRAG (Chain-of-Retrieval Augmented Generation) extends this from agentic search behavior to explicitly trained retrieval models. Training via rejection sampling generates intermediate retrieval chains; test-time compute is controlled via decoding strategies (greedy / best-of-N / tree search). The same monotonic scaling relationship holds: more retrieval budget yields better answers on multi-hop QA. The TTS scaling law is not specific to reasoning tokens or agentic search — it is a general property of any iterative process with quality-sensitive intermediate steps. See Can retrieval be scaled like reasoning at test time?.

Search-R1 and R1-Searcher demonstrate RL-based approaches that teach LLMs to autonomously invoke search during reasoning. Search-R1 (2025) uses retrieved token masking for stable RL training and a simple outcome-based reward, achieving 24% improvement (Qwen2.5-7B) over RAG baselines. The model learns multi-turn search with / token pairs. R1-Searcher (2025) introduces a two-stage approach: first a retrieve-reward incentivizes the model to conduct retrieval operations correctly, then an answer-reward encourages effective utilization of retrieved knowledge. Both demonstrate that RL training enables test-time scaling of tool calls — models learn to invoke search more frequently and more effectively as task difficulty increases, confirming the search-budget scaling law.


Source: Deep Research

Related concepts in this collection

Concept map
18 direct connections · 150 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

agentic deep research exhibits a test-time scaling law where search budget determines answer quality creating a new inference-compute axis