Where do reasoning agents actually fail during long traces?
Does verifying only final answers miss the real sources of failure in multi-step reasoning? This explores whether intermediate process checks reveal errors that outcome-level scoring hides.
As reasoning models produce long traces of intermediate decisions and tool calls, the locus of reliability shifts. interwhen makes the framing explicit: verifying only the final answer misses errors that occur early in the trace, so the unit of verification should be the process — intermediate states, tool calls, and policy compliance — checked continuously as the trace unfolds. The paper's agentic results dramatize the gap: pass^4 on the Telecom τ²-bench domain rises from 32% to 87% once intermediate verification is added, because most failures are not wrong final answers but process violations that compound.
This is a pattern, not a single result. Process-level supervision recurs across the literature as more informative than outcome-level supervision: process reward models score steps, structural-feature supervision derives signal from trajectory shape, and completeness scaffolds force explicit derivation. interwhen's distinctive contribution to the pattern is that it verifies policy compliance — whether the trace obeys a stated policy — not just logical correctness, which extends process verification beyond math and code into agentic domains where "correct" is defined by rules rather than ground-truth answers.
The pattern matters because it changes what "reliable" means for an agent. A model can produce the right final answer through a non-compliant or unsafe process, and outcome verification will pass it; process verification will not. This aligns with the vault's recurring finding that final-output signals are systematically misleading about what happened inside the model. Counterpoint and limit: process verification only helps where the process is checkable — interwhen depends on synthesizable verifiers, and where no verifier exists (open-ended generation, subjective tasks) the reframe offers no leverage. The honest scope is "tasks with formal or policy-expressible correctness criteria," which is broader than math/code but not universal. Why it matters: it reorients reliability engineering for agents away from answer-grading toward continuous in-process auditing.
— "interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification", https://arxiv.org/abs/2602.11202
Related concepts in this collection
-
Can structured templates make code reasoning more reliable than free-form thinking?
Unstructured chain-of-thought reasoning lets models skip cases and make unsupported claims. This explores whether semi-formal templates requiring explicit premises, evidence traces, and alternative checks can prevent these failure modes.
another process-verification instrument: completeness scaffolds rather than asynchronous verifiers
-
Can structured templates replace formal verification for code reasoning?
Formal verification is rigorous but impractical at repository scale. Can natural-language templates with enforced structure provide the same reliability guarantees without the formalization cost? This explores the middle ground between unstructured reasoning and full formalism.
the design-space framing for process checking between unstructured CoT and full formalization
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
why self-verification fails and external process verification is needed instead
-
Can verifiers monitor reasoning without slowing generation down?
Explores whether asynchronous verification can catch reasoning errors while keeping token costs near parity with unmonitored reasoning. Matters because current approaches trade between catching early errors and computational overhead.
enables: a concrete architecture for the in-process auditing this reframe demands, with verification run off the generation path
-
What should we actually measure in agent evaluation?
Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?
extends: carries the process-not-outcome shift from single-trace verification up to whole-agent evaluation
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
Original note title
reframing reliability as verifying the reasoning process not just the final output