interwhen: A Generalizable Framework for Steering Reasoning Models with Test-time Verification

Paper · arXiv 2602.11202
Test-Time ComputeReasoning CritiquesReward ModelsTask Planning

Reasoning models produce long traces of intermediate decisions and tool calls, making test-time verification increasingly important for ensuring correctness. Existing approaches either verify only the final answer, which misses early errors, or rely on branch-and-verify strategies that explore multiple trajectories at substantially higher compute cost. We introduce interwhen, a single-trajectory verification framework that steers model behavior by providing feedback on intermediate reasoning traces. It addresses two key challenges. First, given a set of verifiers, obtaining verifiable states from the reasoning trace typically requires prompt engineering or external task decomposition into fixed steps, which can constrain the model's reasoning strategy. Instead, we propose a monitoring system that periodically polls the reasoning trace and forks inference of the reasoning model to recover intermediate states. Verifiers are run asynchronously alongside generation, adding negligible overhead on correct executions and intervening only when violations occur. Second, beyond math and code, a central challenge for process verification is the scarcity of verifiers. interwhen addresses this through automatic verifier synthesis from natural-language policy documents. Given a policy, it can generate code-based verifiers, including provably correct verifiers in Lean and z3. Together, these contributions yield a plug-and-play test-time verification system that can improve task completion and policy compliance of any reasoning agent.

Large language model (LLM)-based agents are being deployed in high-stakes real-world workflows. In these settings, safety and reliability are essential. In particular, agentic deployments involve sequences of decisions interleaved with tool calls, database writes, and external API interactions, many of which are irreversible. Similarly, reasoning agents solving complex math or logical problems need to follow certain axioms and rules during the process of finding a solution, failing which the final solution is expected to be incorrect. This raises a critical challenge: it is insufficient to verify only the final output of a reasoning agent; the process by which the agent arrived at that output must itself be correct. We call this setting LLM-Process-Modulo, contrasting with only verifying the final output (LLM-Modulo).

We propose a framework for steering a single reasoning trajectory and ensuring that it is policy compliant. Instead of prompting the model to reflect or verify, the key contribution is to decouple verification from generation: verifiers execute asynchronously over the model's reasoning trace. We develop code-based verifiers for the task and then at runtime, use a language model to extract their input variables from a partial reasoning trace. Specifically, at regular intervals, we create a forked execution of the reasoning model that is prompted to extract verifiable states from the trace so far. This design allows any formal or code-based verifier to be plugged in, once state variables are extracted. Further, asynchronous execution interrupts only when a violation is detected (or a write operation is being attempted), thus incurring minimal latency penalty on correct executions.

We evaluate interwhen on eight benchmarks, spanning agentic and non-agentic reasoning. On all datasets, interwhen leads to improvements in task quality and policy compliance rate. For the agentic benchmarks, interwhen leads to significant boosts in both task completion rate and compliance rate. For example, pass^4 for the Telecom domain in τ²-bench increases from 32% to 87% for Qwen3-30B model under solo mode. For the non-agentic datasets, at each compute budget, interwhen outperforms baseline test-time scaling methods, and the gain is higher for harder tasks. Adding the completeness verifier reaches 87.70% reward with 90.80% env-pass and 98.80% action-pass all while using fewer tokens than CoT. Overall, each verifier contributes a clearly attributable slice of the gain, and the final configuration dominates CoT on every column at near-parity token cost.