Does raw token spending actually predict agent performance?
Standard measures of agent effort—tokens, tool calls, operations—may not capture what makes inference-time scaling work. This explores what actually drives performance gains when agents spend more compute.
Test-time scaling analyses usually parameterize an agent's effort by raw expenditure — tokens, tool calls, operations, wall time, or cost. But two trajectories with identical token counts can differ sharply in whether their observations were useful. Raw budget does not predict when more inference-time computation will actually help.
The paper introduces Effective Feedback Compute (EFC): a trace-level coordinate that credits feedback only when it is informative, valid, non-redundant, and retained for subsequent decisions, then normalizes by task demand. Empirically the gap is stark — across controlled simulations, executable code tasks, and real benchmark traces, raw tokens and tool calls explain limited variance (R²=0.33 and 0.42), a strong multivariate baseline reaches 0.88, while EFC-based coordinates reach 0.94 and the demand-normalized version reaches 0.99.
The mechanistic payoff: harness interventions matter because they change how efficiently raw budget converts into durable feedback. Under matched raw budgets, improving feedback quality substantially increases success. This reframes "spend more compute" — the lever is not quantity of interaction but the rate at which interaction produces valid, retained evidence. It gives Where does agent reliability actually come from? a measurable scaling coordinate, and it dovetails with Can externalizing bookkeeping improve search agent performance?: externalizing bookkeeping raises EFC by keeping feedback non-redundant and retained.
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does agent efficiency really break down into three distinct components?
Can we understand agent efficiency as three independent optimization problems—memory, tool use, and planning—each with separate cost drivers? This matters because it could explain why point optimizations keep missing the bigger picture.
complements: EFC is the outcome-side coordinate to those input-side axes
-
What should we actually measure in agent evaluation?
Current agent benchmarks reduce performance to a single success metric, potentially hiding critical differences in how agents operate. What dimensions beyond task accuracy should evaluation frameworks capture?
EFC operationalizes "context efficiency" as a predictive quantity
-
How should we balance parallel versus sequential compute at test time?
Test-time compute can prioritize breadth (trying many approaches) or depth (refining one approach). Which strategy works better, and does the answer depend on the problem?
EFC explains why raw-compute scaling curves disagree across harnesses
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Scaling Laws for Agent Harnesses via Effective Feedback Compute
- How we built our multi-agent research system
- Artifacts as Memory Beyond the Agent Boundary
- LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
- Towards a Science of Scaling Agent Systems
- Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets
- LLMs Corrupt Your Documents When You Delegate
- Evaluation and Benchmarking of LLM Agents: A Survey
Original note title
agent-harness scaling is governed by effective feedback compute not raw token or tool-call expenditure