How does interaction horizon differ from chain-of-thought depth?

This explores the difference between two ways of giving a model more thinking room at test time: making a single reasoning chain longer and deeper (chain-of-thought depth) versus letting an agent take more turns acting in an environment (interaction horizon).

This explores the difference between two ways of spending test-time compute: stretching a single reasoning chain longer and deeper, versus letting an agent take more steps in an environment before it has to commit. The corpus treats these as genuinely separate axes — not the same dial with a different label. The clearest statement of this is the finding that agent interaction scaling is *orthogonal* to chain-of-thought scaling Does agent interaction time scale separately from reasoning depth?: adding more environment steps buys you exploration, backtracking, and replanning — things that no amount of per-step verbalization can produce — and this matters most on tasks where the model can't see the whole problem at once (partial observability). Depth makes one guess smarter; horizon lets you take a guess, see what happened, and revise.

The reason this distinction bites is that chain-of-thought depth has surprisingly hard ceilings. Accuracy along the depth axis follows an inverted-U: past some point, more reasoning tokens make answers *worse*, and more capable models actually prefer shorter chains Why does chain of thought accuracy eventually decline with length?. Worse, the length of a chain often isn't tracking how hard the problem is at all — it tracks how close the problem sits to the training distribution, decoupling entirely once you go out of distribution Does longer reasoning actually mean harder problems?. And long chains can quietly drift into theater: fine-tuning can make reasoning steps stop actually driving the final answer Does fine-tuning disconnect reasoning steps from final answers?, while a more foundational critique argues CoT is constrained imitation of reasoning *shape* rather than genuine inference Why does chain-of-thought reasoning fail in predictable ways?. So pouring compute into depth runs into diminishing — sometimes negative — returns.

Interaction horizon sidesteps a lot of that by grounding each step in something external. ReAct is the canonical illustration: interleaving reasoning with real tool queries injects real-world feedback at every turn and prevents the error propagation that pure thinking falls into, beating CoT by large margins on knowledge-intensive and interactive tasks Can interleaving reasoning with real-world feedback prevent hallucination?. The lever you pull to scale horizon is different too — curriculum-based RL on *rollout length* rather than on chain length Does agent interaction time scale separately from reasoning depth?.

Here's the thing worth knowing that you might not have gone looking for: depth and horizon fail in mirror-image ways. Depth's characteristic failure is *underthinking* — models abandoning promising reasoning paths too early, which you can patch by penalizing thought-switching Do reasoning models switch between ideas too frequently?. The exploration story flips it: the way to fix depth-only reasoning isn't to think harder along one path but to force breadth — generating diverse abstractions instead of more solution samples Can abstractions guide exploration better than depth alone?. That hints that "depth" and "horizon" are both really proxies for a richer variable — the *shape* of the computation. The reasoning-topology taxonomy makes this literal: chains, trees, and graphs are formally distinct structures, and a graph's ability to merge multiple lines (in-degree > 1) lets it do divide-and-conquer synthesis a linear chain simply cannot express Can reasoning topologies be formally classified as graph types?. Seen that way, chain-of-thought depth is the length of one path; interaction horizon is how many times you get to branch, observe, and rejoin.

Sources 9 notes

Does agent interaction time scale separately from reasoning depth?

Test-time interaction—increasing environment steps—enables exploration, backtracking, and replanning that per-step reasoning cannot achieve. Curriculum-based RL on rollout length produces SOTA web agents, showing interaction scaling dominates on tasks with partial observability.

Why does chain of thought accuracy eventually decline with length?

Task accuracy peaks at intermediate CoT length, with optimal length increasing alongside task difficulty but decreasing with model capability. RL training naturally gravitates toward shorter chains as models improve, revealing that simplicity emerges from reward signals rather than explicit training.

Does longer reasoning actually mean harder problems?

Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.

Does fine-tuning disconnect reasoning steps from final answers?

Three faithfulness tests show fine-tuned models generate reasoning chains that less reliably influence final outputs. Early termination, paraphrasing, and filler substitution all produce invariant answers more often after fine-tuning, suggesting reasoning becomes performative rather than functional.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Can interleaving reasoning with real-world feedback prevent hallucination?

ReAct demonstrates that alternating verbal reasoning with external tool queries (Wikipedia API, environment interaction) prevents error propagation by injecting real-world feedback at each step. On knowledge-intensive and interactive tasks, this approach outperforms pure chain-of-thought and reinforcement learning by 10-34% absolute accuracy.

Do reasoning models switch between ideas too frequently?

o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can reasoning topologies be formally classified as graph types?

CoT, ToT, and GoT map precisely to path graphs, trees, and arbitrary directed graphs respectively. The topology is not metaphorical but defines actual computational structure—GoT's in-degree > 1 enables divide-and-conquer synthesis that trees cannot express.

How does interaction horizon differ from chain-of-thought depth?

Sources 9 notes

Next inquiring lines