SYNTHESIS NOTE
Agentic Systems and Tool Use

Does raw token spending actually predict agent performance?

Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.

Synthesis note · 2026-06-03 · sourced from Test Time Compute

Test-time scaling analyses usually parameterize an agent's effort by raw expenditure — tokens, tool calls, operations, wall time, or cost. But two trajectories with identical token counts can differ sharply in whether their observations were useful. Raw budget does not predict when more inference-time computation will actually help.

The paper introduces Effective Feedback Compute (EFC): a trace-level coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, then normalizes by task demand. Empirically the gap is stark — across controlled simulations, executable code tasks, and real benchmark traces, raw tokens and tool calls explain limited variance (R²=0.33 and 0.42), a strong multivariate baseline reaches 0.88, while EFC-based coordinates reach 0.94 and the demand-normalized version reaches 0.99.

The mechanistic payoff: harness interventions matter because they change how efficiently raw budget converts into durable feedback. Under matched raw budgets, improving feedback quality substantially increases success. This reframes "spend more compute" — the lever is not quantity of interaction but the rate at which interaction produces valid, retained evidence. It gives Where does agent reliability actually come from? a measurable scaling coordinate, and it dovetails with Can externalizing bookkeeping improve search agent performance?: externalizing bookkeeping raises EFC by keeping feedback non-redundant and retained.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map
14 direct connections · 122 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

agent-harness scaling is governed by effective feedback compute not raw token or tool-call expenditure