What metrics replace throughput per token for agent deployment?
This explores what we should measure instead of tokens-per-second once agents run as persistent, long-horizon systems rather than single prompt-response calls.
This explores what we should measure instead of tokens-per-second once agents run as persistent, long-horizon systems — and the corpus suggests the denominator itself changes, not just the metric. The most direct answer is that the meaningful unit shifts from the token to the completed artifact. A 115-day case study found that 82.9% of tokens were cache reads, which means counting raw tokens badly misrepresents what work actually cost; when context persists and gets reused, the honest denominator is finished pieces of work, not individual tokens Do persistent agents really cost less per token?.
But cost-per-artifact is only one axis. A recurring theme is that a single number — whether throughput or task-success — hides the multidimensional behavior that actually determines whether an agent is deployable. One line of work argues capability is a *vector* across separable axes: task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness, where models that top one axis often rank low on another, making single-score rankings systematically misleading Does a single benchmark score actually predict agent readiness?. A closely related argument reframes evaluation around the *trajectory* rather than the endpoint, proposing benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost What should we actually measure in agent evaluation?. Notice the overlap: "context efficiency" and "verification cost" are throughput-adjacent metrics, but normalized against useful progress rather than raw generation speed.
The reason these replacements matter becomes sharp when you look at multi-agent systems, where token spending is exposed as a confound rather than a virtue. Several notes converge on the finding that roughly 80% of multi-agent performance variance is explained by token budget alone, not coordination intelligence — systems can burn 15× more tokens than a single agent, with coordination yielding negative returns past a certain accuracy threshold Does token spending drive multi-agent research performance? Are multi-agent systems actually intelligent coordination or just token spending? How does test-time scaling work at the agent level?. If more tokens almost mechanically buy more performance, then throughput-per-token tells you nothing about whether the architecture is good — it just tells you how hard you stepped on the gas. The useful metric becomes performance *per dollar* or *per artifact* with token spend held constant, which is exactly why heterogeneous designs that route most subtasks to small models at 10–30× lower cost look economically rational Can small language models handle most agent tasks?.
There's a more surprising candidate metric hiding here, too: how much *learning signal* the deployment generates. One line argues every agent action emits a next-state signal — a user reply, a tool output, an error, a changed GUI — that can train the policy directly, turning deployment itself into a training loop Can agent deployment itself generate training signals automatically?. Under that lens, a deployed agent's value isn't only artifacts produced but usable signal per interaction. And on the efficiency side, methods like shared-prefix tree rollouts measure *distinct trajectories per token budget* — squeezing more independent learning out of the same compute — while tool-call credit assignment improves *sample efficiency* by attributing reward to the tokens that mattered Can shared-prefix trees reduce redundancy in agent rollouts? Can simulated APIs and token-level credit assignment train better tool-using agents?.
So the replacement isn't one metric but a small family, organized by what you actually care about: cost-per-artifact (economics), the capability vector and trajectory-quality measures (deployability), performance-at-fixed-token-budget (architecture honesty), and signal-per-interaction (learning value). The thread connecting all of them is that they normalize against *useful work accomplished*, which is precisely what raw throughput-per-token erases.
Sources 10 notes
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.
Capability decomposes into task success, privacy compliance, long-horizon retention, mode-shift behavior, and ecosystem readiness. Models ranked highest on one axis often rank lower on others, making single-score evaluations systematically misleading for real deployment.
Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.
Anthropic's internal evals show token spending alone accounts for 80% of performance variance in multi-agent research systems. Model capability upgrades deliver larger gains than doubling token budget, suggesting efficiency matters as much as quantity.
Research shows token usage explains 80% of multi-agent performance variance, systems use 15× more tokens than single agents, and coordination yields negative returns above 45% accuracy. Performance gains come from token distribution, not coordination sophistication.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
Every agent action produces a next-state signal (user reply, tool output, error, GUI change) that can train the policy directly. This universal signal source eliminates the need for separate training datasets across conversations, terminal tasks, SWE, and tool use.
Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.
ToolPO replaces costly real-API interactions with LLM-simulated ones and assigns credit directly to tool-invocation tokens rather than spreading outcome rewards across trajectories. This combination improves training stability and sample efficiency for tool-using agents.