Agentic and Multi-Agent Systems Reinforcement Learning for LLMs

Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Note · 2026-05-03 · sourced from Tool Computer Use

The TTI paper makes a precise argument about test-time scaling: chain-of-thought scaling and interaction scaling are orthogonal axes, and conflating them misses what agentic tasks actually need. CoT scales per-step compute by generating long reasoning traces before acting. This deepens reasoning but provides zero new information from the environment. In partially observable agentic tasks, deeper reasoning about a wrongly-bounded set of options does not help — the model still cannot see hotels it has not browsed.

Interaction scaling instead increases the number of interaction steps the agent takes. This enables behaviors that CoT cannot produce: exploration (browse multiple options before committing), backtracking (retreat from a bad path), and dynamic re-planning (revise the plan based on what the environment revealed). Information gain through environment interaction is unique to agentic tasks with partial observability, and it requires interaction, not larger per-step compute.

Empirically the claim is supported on two fronts. Even pure prompting-based interaction scaling — no training — improves task success on web benchmarks non-trivially. With training, TTI uses curriculum-based online RL that adaptively adjusts rollout lengths, producing SOTA open-source open-data web agents on WebVoyager and WebArena from a Gemma 3 12B model. The curriculum aspect matters because TTI shows agents learn to balance exploration and exploitation adaptively — long rollouts when information gathering pays, short rollouts when the next action is clear.

The reframe for the field: test-time scaling is multi-dimensional. CoT and interaction scaling are complementary, not substitutes. Agents that ship with deep reasoning per step but no learned policy for when to keep interacting are leaving capability on the table — and on tasks where exploration matters, the interaction axis dominates. This connects to How should we balance parallel versus sequential compute at test time? as a third axis: parallel sampling, sequential per-step depth, and interaction horizon are three orthogonal dimensions of inference budget. It also generalizes Does search budget scale like reasoning tokens for answer quality? — search budget is the deep-research instance of interaction scaling.

Source: Tool Computer Use

Related concepts in this collection

How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
extends: TTI adds a third orthogonal axis (interaction horizon) to the parallel-vs-sequential dichotomy. Three-dimensional scaling space: how many parallel candidates, how deep per step, how long the interaction.
Does search budget scale like reasoning tokens for answer quality? Explores whether the test-time scaling law that applies to reasoning tokens also governs search-based retrieval in agentic systems. Understanding this relationship could reshape how we allocate inference compute between thinking and searching.
generalizes: deep research's search-budget scaling is the search instance of interaction-horizon scaling — same orthogonality argument, broader scope.
Does gradually tightening token budgets beat fixed budget training? Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.
extends: TTI's curriculum-based rollout-length RL is the agentic analog of curriculum-budget RL for reasoning — both find adaptive-budget curricula beat fixed budgets.
Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
complicates: under same-token budgets parallel beats sequential; TTI's interaction-scaling argument is about a different budget (action steps, not tokens) where the axes shouldn't be compared directly.
How should we allocate compute budget at inference time? Test-time scaling asks how to spend computational budget during inference to make models smarter. The key puzzle: should all prompts get equal compute, or should difficult queries get more?
extends: TTI is direct evidence that the test-time-scaling topic map needs an explicit interaction-scaling sub-section alongside CoT and parallel-vs-sequential.

Concept map

13 direct connections · 147 in 2-hop network ·dense cluster

Does agent interaction time scale separately fro… How should we balance parallel versus sequential c… Does search budget scale like reasoning tokens for… Does gradually tightening token budgets beat fixed… Why does parallel reasoning outperform single chai… How should we allocate compute budget at inference…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

test-time interaction scaling is a distinct dimension from chain-of-thought — increasing the agent's interaction horizon enables exploration backtracking and dynamic re-planning that deeper reasoning cannot