Agentic and Multi-Agent Systems Reinforcement Learning for LLMs

Does agent interaction time scale separately from reasoning depth?

Can agents improve by taking more environment steps rather than thinking harder per step? This matters because partially observable tasks like web navigation may need exploration and backtracking that deeper reasoning alone cannot provide.

Note · 2026-05-03 · sourced from Tool Computer Use

The TTI paper makes a precise argument about test-time scaling: chain-of-thought scaling and interaction scaling are orthogonal axes, and conflating them misses what agentic tasks actually need. CoT scales per-step compute by generating long reasoning traces before acting. This deepens reasoning but provides zero new information from the environment. In partially observable agentic tasks, deeper reasoning about a wrongly-bounded set of options does not help — the model still cannot see hotels it has not browsed.

Interaction scaling instead increases the number of interaction steps the agent takes. This enables behaviors that CoT cannot produce: exploration (browse multiple options before committing), backtracking (retreat from a bad path), and dynamic re-planning (revise the plan based on what the environment revealed). Information gain through environment interaction is unique to agentic tasks with partial observability, and it requires interaction, not larger per-step compute.

Empirically the claim is supported on two fronts. Even pure prompting-based interaction scaling — no training — improves task success on web benchmarks non-trivially. With training, TTI uses curriculum-based online RL that adaptively adjusts rollout lengths, producing SOTA open-source open-data web agents on WebVoyager and WebArena from a Gemma 3 12B model. The curriculum aspect matters because TTI shows agents learn to balance exploration and exploitation adaptively — long rollouts when information gathering pays, short rollouts when the next action is clear.

The reframe for the field: test-time scaling is multi-dimensional. CoT and interaction scaling are complementary, not substitutes. Agents that ship with deep reasoning per step but no learned policy for when to keep interacting are leaving capability on the table — and on tasks where exploration matters, the interaction axis dominates. This connects to How should we balance parallel versus sequential compute at test time? as a third axis: parallel sampling, sequential per-step depth, and interaction horizon are three orthogonal dimensions of inference budget. It also generalizes Does search budget scale like reasoning tokens for answer quality? — search budget is the deep-research instance of interaction scaling.


Source: Tool Computer Use

Related concepts in this collection

Concept map
13 direct connections · 147 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

test-time interaction scaling is a distinct dimension from chain-of-thought — increasing the agent's interaction horizon enables exploration backtracking and dynamic re-planning that deeper reasoning cannot