Agentic Systems and Planning

Can structured reasoning replace code execution for RL rewards?

Can semi-formal templates enable execution-free code verification reliable enough to train RL agents without running code? This matters because execution is expensive and slow in agent training loops.

Note · 2026-05-18 · sourced from Tool Computer Use

Code-agent training has a recurring constraint: real-world deployments often cannot afford full code execution as a verification step. Execution requires sandboxing, environment setup, test infrastructure, and time — costs that compound across the many rollouts agent RL needs. Recent work has explored execution-free alternatives. SWE-RM trains reward models to approximate test outcomes. Agentic Rubrics decompose verification into LLM-generated criteria. CodeJudge uses LLMs directly as evaluators. All three approaches keep humans (or LLMs) in the verification loop without running the code, but all three use unstructured reasoning that lets the verifier make claims without justifying them.

Agentic Code Reasoning changes the calculus. With semi-formal reasoning templates that act as certificates, execution-free verification can reach reliability levels that previous execution-free methods could not. On patch equivalence verification — a task where the verifier must determine if two code changes have the same effect — accuracy reaches 93% on real-world agent-generated patches. That number is the threshold relevant for RL design: at 93% reward reliability, the noise from misjudgments is comparable to the noise in other RL components, and the reward signal is usable for training.

The architectural consequence is that a major bottleneck in coding-agent RL — the cost of execution-based reward — has a viable alternative for some task classes. Patch equivalence is one. Fault localization (which the paper also evaluates with semi-formal reasoning) is another. Code question-answering is a third. For these tasks, execution-free verification using structured templates is now a real option, not a quality-cost trade-off.

This does not eliminate execution from coding-agent pipelines. There are tasks where execution remains necessary — anything requiring runtime behavior with side effects, anything where the formal-language gap is too wide. But for verification tasks that can be expressed as "trace this code and conclude X," structured reasoning is becoming viable. The boundary between execution-required and execution-free shifts.

Related concepts in this collection

Concept map
13 direct connections · 91 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

execution-free code reasoning can approach the reliability needed for RL reward signals when the reasoning is structured