How should benchmarks measure agent efficiency across all three cost dimensions?

This explores how to score an agent not on a single efficiency number but across its three distinct cost dimensions — and why a benchmark that collapses them hides real tradeoffs.

This explores how to score an agent not on a single efficiency number but across its three distinct cost dimensions — and the corpus is unusually direct about why those dimensions resist being merged. The starting move is recognizing that agent efficiency isn't one thing: research decomposes it into three structurally independent axes — memory compression, tool-learning efficiency, and planning optimization — each carrying its own cost profile in tokens, latency, and steps Does agent efficiency really break down into three distinct components?. Because the axes are orthogonal, improving one (say, fewer planning steps) doesn't reduce another (token-heavy memory), so a benchmark that reports a single efficiency score will systematically reward agents that happen to be cheap on the dimension it accidentally weights.

The deeper argument is that single-score evaluation isn't just imprecise — it's misleading. Capability itself is a vector across separable axes, and models that top one axis often rank low on others, which makes any one-number ranking a poor predictor of deployment readiness Does a single benchmark score actually predict agent readiness?. The same logic that breaks capability scoring breaks efficiency scoring: a good benchmark should publish trajectory quality, memory hygiene, context efficiency, and verification cost as distinct readouts rather than averaging them into false confidence What should we actually measure in agent evaluation?. So the answer to 'how should benchmarks measure all three cost dimensions' is partly: don't fuse them — report tokens, latency, and steps side by side, per axis.

What makes this more than a bookkeeping point is a finding that quietly undermines naive token-counting. In multi-agent settings, roughly 80% of performance variance comes from token budget rather than coordination intelligence How does test-time scaling work at the agent level? — meaning if you only measure task success, you're often just measuring spend. And the token denominator itself is unstable: in a long-running deployment, 82.9% of tokens were cache reads, which argues that the meaningful cost unit shifts from cost-per-token to cost-per-completed-artifact once context persists and gets reused Do persistent agents really cost less per token?. A benchmark that prices every token equally will misrank an agent that is expensive per-call but cheap per-finished-task.

Here's the lateral payoff the reader may not expect: the three efficiency axes, despite being independent, converge on shared structural pressures — context bounding, minimizing external calls, and controlled search Do efficiency techniques across agent components reveal shared structural constraints?. That convergence is a gift to benchmark designers, because it suggests the three cost dimensions can be probed with a common vocabulary even while scored separately. It also reframes where efficiency comes from: reliable agents externalize memory, skills, and protocols into a harness layer rather than burning model capacity re-solving the same problems Where does agent reliability actually come from?, so efficiency benchmarks should be testing the harness, not just the model.

Finally, the corpus hints at what efficient design looks like once you measure honestly — which is itself an argument for these benchmarks existing. Small language models handle most repetitive agentic subtasks at 10–30× lower cost, making heterogeneous SLM-default architectures the economically rational pattern Can small language models handle most agent tasks?, and multi-agent teams can deactivate their weakest members at inference time via contribution scoring Can multi-agent teams automatically remove their weakest members?. Neither optimization shows up if your benchmark only watches task success — they only become visible when latency, steps, and tokens are each on the scoreboard.

Sources 9 notes

Does agent efficiency really break down into three distinct components?

Research identifies memory compression, tool learning efficiency, and planning optimization as three structurally independent components, each with distinct cost profiles (tokens, latency, and steps). Improving one axis does not automatically improve the others, requiring holistic design.

Does a single benchmark score actually predict agent readiness?

Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Do efficiency techniques across agent components reveal shared structural constraints?

Techniques for memory, tool learning, and planning independently converge on shared principles: context bounding, minimizing external calls, and controlled search. This convergence suggests these reflect fundamental structural pressures in agentic computation rather than component-specific optimizations.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

Can multi-agent teams automatically remove their weakest members?

DyLAN's three-step importance scoring mechanism (propagation, aggregation, selection) quantifies individual agent contributions and automatically removes uninformative agents during inference, optimizing team composition without task-specific tuning.

How should benchmarks measure agent efficiency across all three cost dimensions?

Sources 9 notes

Next inquiring lines