Agentic Code Reasoning

Paper · arXiv 2603.01896
Tool Use and Computer-Use AgentsChain-of-Thought and Reasoning MethodsReasoning Architectures

Can LLM agents explore codebases and reason about code semantics without executing the code? We study this capability, which we call agentic code reasoning, and introduce semi-formal reasoning: a structured prompting methodology that requires agents to construct explicit premises, trace execution paths, and derive formal conclusions. Unlike unstructured chain-of-thought, semi-formal reasoning acts as a certificate: the agent cannot skip cases or make unsupported claims. We evaluate across three tasks (patch equivalence verification, fault localization, and code question answering) and show that semi-formal reasoning consistently improves accuracy on all of them. For patch equivalence, accuracy improves from 78% to 88% on curated examples and reaches 93% on real-world agent-generated patches, approaching the reliability needed for execution-free RL reward signals.

Recent work has explored execution-free verification for code agents. SWE-RM trains reward models to approximate test outcomes, Agentic Rubrics decompose verification into LLM-generated criteria, and CodeJudge uses LLMs directly as evaluators. However, these approaches use unstructured reasoning, allowing models to make claims about code behavior without explicit justification. At the other extreme, formal verification approaches translate code or reasoning into formal languages like Lean or Coq, enabling automated proof checking. But fully formal methods require formalizing language semantics, which is impractical for arbitrary repository code spanning multiple frameworks and languages. We introduce semi-formal reasoning, a general approach that bridges this gap. Rather than training specialized models or formalizing semantics, we prompt agents with structured reasoning templates that require explicit evidence for each claim. These templates act as certificates: the agent must state premises, trace relevant code paths, and provide formal conclusions.

The motivating example illustrates how these techniques work together on a real patch equivalence task (django-13670). Two patches both attempt to fix 2-digit year formatting for years before 1000 CE. Standard reasoning incorrectly concludes the patches are equivalent, assuming format() is Python’s builtin. With semi-formal analysis, the agent discovers that format is shadowed by a module-level function in Django’s dateformat.py that expects a datetime object, not an integer. This causes Patch 1 to raise an AttributeError while Patch 2 succeeds. Unlike chain-of-thought prompting, which lets the model reason freely, our semi-formal approach requires filling in a structured certificate template with explicit premises, per-test execution traces, and a formal conclusion. This enforces completeness, ensuring the agent cannot skip cases or make unsupported claims, while remaining in natural language rather than requiring a fully formal proof language.

The structured template requires the agent to fill in a function trace table (listing every function examined with file:line locations and verified behavior), data flow analysis (tracing how key variables flow through the code), semantic properties with explicit evidence, and an alternative hypothesis check. This structured format reduces the tendency to guess based on function names, a common failure mode we observed in unstructured reasoning. More broadly, structured agentic reasoning may offer a flexible alternative to classical static analysis tools: rather than encoding analysis logic in specialized algorithms, we can prompt LLM agents with task-specific reasoning templates that generalize across languages and frameworks.

Future directions. Beyond web environments. State-dependent memory is conceptually agnostic to the environment, and the same idea can be naturally extended to general cases of agentic computer use. Richer state encoding. Our proof-of-concept implementation of state-dependent memory uses basic visual and DOM feature overlap along with simple similarity metrics. A richer encoder can improve both retrieval quality and invariance to superficial changes.