Agentic Systems and Planning Reasoning and Knowledge Reasoning and Learning Architectures

Where do reasoning agents actually fail during long traces?

Does verifying only final answers miss the real sources of failure in multi-step reasoning? This explores whether intermediate process checks reveal errors that outcome-level scoring hides.

Note · 2026-05-28 · sourced from Test Time Compute

As reasoning models produce long traces of intermediate decisions and tool calls, the locus of reliability shifts. interwhen makes the framing explicit: verifying only the final answer misses errors that occur early in the trace, so the unit of verification should be the process — intermediate states, tool calls, and policy compliance — checked continuously as the trace unfolds. The paper's agentic results dramatize the gap: pass^4 on the Telecom τ²-bench domain rises from 32% to 87% once intermediate verification is added, because most failures are not wrong final answers but process violations that compound.

This is a pattern, not a single result. Process-level supervision recurs across the literature as more informative than outcome-level supervision: process reward models score steps, structural-feature supervision derives signal from trajectory shape, and completeness scaffolds force explicit derivation. interwhen's distinctive contribution to the pattern is that it verifies policy compliance — whether the trace obeys a stated policy — not just logical correctness, which extends process verification beyond math and code into agentic domains where "correct" is defined by rules rather than ground-truth answers.

The pattern matters because it changes what "reliable" means for an agent. A model can produce the right final answer through a non-compliant or unsafe process, and outcome verification will pass it; process verification will not. This aligns with the vault's recurring finding that final-output signals are systematically misleading about what happened inside the model. Counterpoint and limit: process verification only helps where the process is checkable — interwhen depends on synthesizable verifiers, and where no verifier exists (open-ended generation, subjective tasks) the reframe offers no leverage. The honest scope is "tasks with formal or policy-expressible correctness criteria," which is broader than math/code but not universal. Why it matters: it reorients reliability engineering for agents away from answer-grading toward continuous in-process auditing.


— "interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification", https://arxiv.org/abs/2602.11202

Related concepts in this collection

Concept map
14 direct connections · 131 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

reframing reliability as verifying the reasoning process not just the final output