How should we measure context efficiency and verification cost in agents?

This explores how to actually measure two slippery things in AI agents — how well they use their context window, and what it costs to check their work — and the corpus suggests both belong inside a richer evaluation than 'did the task pass?'

This explores how to measure context efficiency and verification cost in agents — not as abstract metrics, but as the missing dimensions of agent evaluation. The starting point is that single-score, did-it-pass scoring hides exactly these things: one note argues evaluation has to move from one-shot task success to trajectory quality, naming context efficiency and verification cost as first-class benchmarks alongside memory hygiene, because a single number collapses multi-dimensional behavior and breeds false confidence in deployment What should we actually measure in agent evaluation?.

On context efficiency, the corpus points away from cost-per-token as the unit you should optimize. A 115-day deployment study found 82.9% of tokens were cache reads, which means the honest denominator is completed artifacts, not raw tokens — measure cost per finished result, not per word generated Do persistent agents really cost less per token?. Multi-agent work reinforces this from the other side: roughly 80% of performance variance is simply a function of token budget rather than coordination cleverness, so any efficiency metric that doesn't normalize for tokens spent will mistake spending for skill How does test-time scaling work at the agent level?. Efficiency also has structural levers, not just accounting ones — agents that autonomously fold interaction history into episodic, working, and tool memory schemas cut token overhead directly Can agents compress their own memory without losing critical details?, and the broader pattern is that reliability comes from externalizing memory and skills into a harness so the model stops re-solving the same context problem every turn Where does agent reliability actually come from?. A useful efficiency measure, then, asks how much reusable structure the agent builds, not just how lean a single trace looks.

Verification cost is where the corpus gets most concrete — and the key insight is that cheaper verification often catches *more*, not less. Checking the reasoning process mid-trace rather than scoring the final answer lifted task success from 32% to 87%, because most failures turn out to be process violations, not wrong endpoints Where do reasoning agents actually fail during long traces?. That reframes 'cost' away from 'how expensive is a check' toward 'where does a check buy the most error coverage per dollar.' Two notes then show how to make verification cheap enough to actually run continuously: execution-free reasoning over code reaches 93% accuracy on patch-equivalence checks without running anything Can structured reasoning replace code execution for RL rewards?, and treating code as an executable, inspectable, stateful medium lets agents verify their own progress as a side effect of doing the work Can code become the operational substrate for agent reasoning?.

The lateral surprise is what happens when you forget verification cost in multi-agent settings. Coordination degrades predictably with scale not because agents can't talk, but because they accept neighbor information *without verifying it*, letting one error propagate through the network — verification isn't an evaluation afterthought, it's the thing whose absence makes systems fail at scale Why do multi-agent systems fail to coordinate at scale?. And there's a routing-level efficiency angle worth knowing: most agentic subtasks are repetitive enough that small models handle them at 10–30× lower cost, so a real context-efficiency metric should also ask whether you're spending a large model's context budget on work a small one would finish Can small language models handle most agent tasks?. Measured well, efficiency and verification stop being overhead line-items and become the two dials that most decide whether an agent is trustworthy.

Sources 10 notes

What should we actually measure in agent evaluation?

Single-score evaluation collapses multi-dimensional agent behavior and creates false confidence in deployment readiness. Research shows agents need benchmarks for trajectory quality, memory hygiene, context efficiency, and verification cost to reflect actual system performance.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Can agents compress their own memory without losing critical details?

DeepAgent's autonomous memory folding consolidates interaction history into episodic, working, and tool memory schemas. This reduces token overhead while letting agents pause to reconsider strategies—the autonomy and structure together avoid degradation that plagues poorly designed consolidation.

Where does agent reliability actually come from?

Research shows reliable LLM agents externalize three cognitive burdens—memory (state persistence), skills (procedural components), and protocols (structured interaction)—into a harness layer rather than relying on model scale alone. The harness unifies these externalities and eliminates the need for the model to solve the same problems repeatedly.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Can code become the operational substrate for agent reasoning?

Research shows code uniquely enables agents to externalize reasoning, execute policies, model environments, and verify progress through its simultaneous executability, inspectability, and statefulness across task steps.

Why do multi-agent systems fail to coordinate at scale?

AgentsNet benchmark shows agents fail to coordinate strategies either by agreeing too late or adopting strategies without informing neighbors. Agents accept neighbor information without verification, enabling error propagation while remaining capable of detecting direct conflicts.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

How should we measure context efficiency and verification cost in agents?

Sources 10 notes

Next inquiring lines