SYNTHESIS NOTE

Does raw token spending actually predict agent performance?

Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.

Synthesis note · 2026-06-03 · sourced from Test Time Compute

Test-time scaling analyses usually parameterize an agent's effort by raw expenditure — tokens, tool calls, operations, wall time, or cost. But two trajectories with identical token counts can differ sharply in whether their observations were useful. Raw budget does not predict when more inference-time computation will actually help.

The paper introduces Effective Feedback Compute (EFC): a trace-level coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, then normalizes by task demand. Empirically the gap is stark — across controlled simulations, executable code tasks, and real benchmark traces, raw tokens and tool calls explain limited variance (R²=0.33 and 0.42), a strong multivariate baseline reaches 0.88, while EFC-based coordinates reach 0.94 and the demand-normalized version reaches 0.99.

The mechanistic payoff: harness interventions matter because they change how efficiently raw budget converts into durable feedback. Under matched raw budgets, improving feedback quality substantially increases success. This reframes "spend more compute" — the lever is not quantity of interaction but the rate at which interaction produces valid, retained evidence. It gives Where does agent reliability actually come from? a measurable scaling coordinate, and it dovetails with Can externalizing bookkeeping improve search agent performance?: externalizing bookkeeping raises EFC by keeping feedback non-redundant and retained.

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 122 in 2-hop network ·medium cluster Open in graph ↗

Does raw token spending actually predict agent p… Does agent efficiency really break down into three… What should we actually measure in agent evaluatio… How should we balance parallel versus sequential c…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does agent efficiency really break down into three distinct components? Can we understand agent efficiency as three independent optimization problems—memory, tool use, and planning—each with separate cost drivers? This matters because it could explain why point optimizations keep missing the bigger picture.
complements: EFC is the outcome-side coordinate to those input-side axes
What should we actually measure in agent evaluation? Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?
EFC operationalizes "context efficiency" as a predictive quantity
How should we balance parallel versus sequential compute at test time? Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
EFC explains why raw-compute scaling curves disagree across harnesses

Does raw token spending actually predict agent performance?

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4