How do execution traces represent state and dynamics in codebase modeling?

This explores how tracing a program's execution captures what the code holds (state) and how that changes step to step (dynamics) — and the corpus has a surprising twist: it leans toward whether you even need to run the code to model it.

This explores how tracing a program's execution captures both its state (what variables and structures hold at a moment) and its dynamics (how that changes as the code runs) — and the most useful thing the corpus has to say is that the line between *running* code and *reasoning about* code is blurrier than you'd expect.

The anchoring idea is that code is special because it's three things at once: executable, inspectable, and stateful Can code become the operational substrate for agent reasoning?. That triple property is exactly what makes execution traces a good modeling substrate — you can run a step, look at what changed, and carry that state forward into the next step. An agent doesn't just emit code as an answer; it uses the running program as an external memory and a way to verify its own progress. When you see reasoning embedded in explicit algorithms that manage control flow and hand each model call only the state it needs Can algorithms control LLM reasoning better than LLMs alone?, that's the same instinct: treat the program's evolving state as the thing being modeled, and the steps as the dynamics.

Here's the twist worth knowing. You might assume you need to actually execute code to capture its state and dynamics — but the corpus suggests you can often reconstruct the trace by reasoning instead of running. Semi-formal reasoning templates that force an agent to write out premises, walk the code paths, and check evidence reach 93% accuracy on verifying whether two patches do the same thing — without execution Can structured reasoning replace code execution for RL rewards?. The templates act like a completeness checklist, catching things free-form thinking misses, like one function quietly shadowing another Can structured templates make code reasoning more reliable than free-form thinking?. In other words, a disciplined *described* trace of state-changes can substitute for an *observed* one.

But there's a sharp caveat the corpus keeps returning to: a reasoning trace that *looks* like it's modeling execution may not actually be doing so. Across many models, traces turn out to be persuasive appearances rather than faithful records — invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize about as well as clean ones Do reasoning traces show how models actually think?. Reflection rarely corrects course and the trace rarely explains the real computation Can we actually trust reasoning model outputs?. So when a trace claims to represent program state, that representation is only as trustworthy as the structure forcing it to be — which is exactly why the template-and-certificate approaches matter.

This is where structural signals come in. Instead of trusting a trace's narrative, you can read its *shape*: tree topology, tool-call positions, expert-aligned actions become dense step-by-step signals about whether the dynamics are sound Can trajectory structure replace hand-annotated process rewards?. And quality beats quantity — local, step-level confidence catches a breakdown at the exact moment state goes wrong, which global averaging across the whole trace would smooth over Does step-level confidence outperform global averaging for trace filtering?. The throughline: execution traces are a powerful way to represent state and dynamics, but only when the structure around them keeps the trace honest rather than merely fluent.

Sources 8 notes

Can code become the operational substrate for agent reasoning?

Research shows code uniquely enables agents to externalize reasoning, execute policies, model environments, and verify progress through its simultaneous executability, inspectability, and statefulness across task steps.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Can structured templates make code reasoning more reliable than free-form thinking?

Semi-formal templates requiring explicit premises, code-path traces, and evidence checks improved patch equivalence accuracy from 78% to 88%, catching cases like function shadowing that free-form reasoning missed. Templates act as completeness certificates without formal verification.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Can we actually trust reasoning model outputs?

Research across eight models shows reflection is mostly confirmatory theater—reflections rarely change initial answers and traces don't faithfully represent reasoning. Calibration degrades under binary reward training, and monitoring mechanisms are easily gamed.

Can trajectory structure replace hand-annotated process rewards?

Tree-GRPO, Supervised RL, and ToolPO each convert sparse outcome rewards into dense step signals by exploiting different structural features—tree topology, expert-aligned actions, and tool-call positions—eliminating the need for annotated process reward models.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

How do execution traces represent state and dynamics in codebase modeling?

Sources 8 notes

Next inquiring lines