Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does search budget scale like reasoning tokens for answer quality?

Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.

Note · 2026-02-21 · sourced from Deep Research

The test-time scaling framework — more inference compute yields better answers up to a threshold — has been documented for reasoning token budgets in chain-of-thought models. The Agentic Deep Research finding extends this to search: more search steps, more retrieval rounds, better answers. The relationship follows the same shape.

This matters because it multiplies the design space for inference-time compute. Before, the question was "how many tokens to think?" Now there are two axes: reasoning budget per query and search budget per query. They are not independent — longer chains may require more retrieval to validate intermediate steps, and more retrieval may require more reasoning to synthesize. The optimal allocation problem gets harder.

The practical implication is that "deep research quality" is not a fixed property of a model — it is a function of the search budget you give it. A mid-sized model with a large search budget can outperform a large model with a restricted one. This shifts cost optimization from training compute to inference architecture, specifically the retrieval loop.

The finding also reframes what "thinking harder" means for agents. For single-turn reasoning models, thinking harder means more tokens per response. For search agents, thinking harder means more search-retrieve-synthesize iterations. How should we balance parallel versus sequential compute at test time? applies here too: the question of whether to parallelize retrieval across multiple query variants (parallel) or chain them iteratively (sequential) is the same structural trade-off operating at the retrieval level.

CoRAG (Chain-of-Retrieval Augmented Generation) extends this from agentic search behavior to explicitly trained retrieval models. Training via rejection sampling generates intermediate retrieval chains; test-time compute is controlled via decoding strategies (greedy / best-of-N / tree search). The same monotonic scaling relationship holds: more retrieval budget yields better answers on multi-hop QA. The TTS scaling law is not specific to reasoning tokens or agentic search — it is a general property of any iterative process with quality-sensitive intermediate steps. See Can retrieval be scaled like reasoning at test time?.

Search-R1 and R1-Searcher demonstrate RL-based approaches that teach LLMs to autonomously invoke search during reasoning. Search-R1 (2025) uses retrieved token masking for stable RL training and a simple outcome-based reward, achieving 24% improvement (Qwen2.5-7B) over RAG baselines. The model learns multi-turn search with

/ token pairs. R1-Searcher (2025) introduces a two-stage approach: first a retrieve-reward incentivizes the model to conduct retrieval operations correctly, then an answer-reward encourages effective utilization of retrieved knowledge. Both demonstrate that RL training enables test-time scaling of tool calls — models learn to invoke search more frequently and more effectively as task difficulty increases, confirming the search-budget scaling law.

Source: Deep Research

Related concepts in this collection

Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
extends: search budget is now a second compute axis alongside reasoning tokens; adaptive allocation must account for both
How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
applies: parallel retrieval (multiple query variants) vs sequential retrieval (chained iterations) is the same structural trade-off
How should we categorize different test-time scaling approaches? Test-time scaling research spans multiple strategies for improving model performance at inference. Understanding how these approaches differ—and how they relate—helps researchers and practitioners choose the right method for their constraints.
extends: search-based DR is the clearest case of external TTS; this finding quantifies its scaling behavior

Concept map

18 direct connections · 150 in 2-hop network ·medium cluster

Does search budget scale like reasoning tokens f… Can we allocate inference compute based on prompt … How should we balance parallel versus sequential c… How should we categorize different test-time scali…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

agentic deep research exhibits a test-time scaling law where search budget determines answer quality creating a new inference-compute axis