How much does external API latency dominate total agent execution cost?

This explores whether the slow round-trips to outside services (search engines, tool/function APIs, UI calls) are the thing that actually drives an agent's cost — and the corpus answers sideways: it treats external calls as a dominant cost driver, but measures that cost more in tokens and task-completion time than in raw network latency.

This explores whether external API latency dominates total agent execution cost. The corpus doesn't measure wall-clock latency head-on, but it converges hard on a related claim: external calls are the expensive part, and a surprising amount of agent research is really about getting rid of them. The strongest single signal is Do efficiency techniques across agent components reveal shared structural constraints? — independent work on memory, tool use, and planning all rediscover the same principle of 'minimizing external calls,' which suggests round-trips to outside systems are a structural cost pressure, not an incidental one.

Where the corpus gets concrete, the cost shows up as time rather than network milliseconds. The AXIS framework in Can API-first agents outperform UI-based agent interaction? cuts task completion time 65–70% specifically by replacing long sequences of UI interactions with direct API calls — so the interesting twist is that the slow path isn't the API, it's the chatty back-and-forth the API lets you skip. The bottleneck is the *number* of interaction steps, and a single well-chosen call collapses many slow ones.

Two training-side notes attack external-call cost so aggressively they delete the API entirely: Can LLMs replace search engines during agent training? shows a 14B model can generate search results from internal knowledge well enough to skip real search APIs during RL, and Can simulated APIs and token-level credit assignment train better tool-using agents? replaces costly real-API interactions with LLM-simulated ones. You don't simulate away something that's cheap and fast — the fact that 'fake the API with another model' is a winning move tells you real external calls were the dominant burden in that loop.

The quieter counterpoint is that for *inference*, the corpus keeps pointing at tokens, not latency, as the cost denominator. How does test-time scaling work at the agent level? finds 80% of multi-agent performance variance comes from token budget, and Do persistent agents really cost less per token? argues the right unit is completed artifacts (with 82.9% of tokens served from cache). So 'cost' splits in two: the compute/token bill, which the corpus measures carefully, and external-call latency, which it treats as a thing to engineer around rather than a thing to quantify.

The useful surprise: nobody in this collection has actually published 'external API latency is X% of total cost.' What they've published is a stack of techniques whose entire reason for existing is to avoid, batch, cache, or simulate external calls — including using cheaper small models for the repetitive tool-shaped work in Can small language models handle most agent tasks?. The dominance of external calls is the unstated premise behind the whole efficiency literature, even where no one stops to put a number on it.

Sources 7 notes

Do efficiency techniques across agent components reveal shared structural constraints?

Techniques for memory, tool learning, and planning independently converge on shared principles: context bounding, minimizing external calls, and controlled search. This convergence suggests these reflect fundamental structural pressures in agentic computation rather than component-specific optimizations.

Can API-first agents outperform UI-based agent interaction?

The AXIS framework shows that prioritizing API calls over sequential UI interactions cuts task completion time by 65–70% while maintaining 97–98% accuracy and reducing cognitive workload by 38–53%. A self-exploration mechanism automatically discovers and constructs APIs from existing applications, solving the bootstrapping problem.

Can LLMs replace search engines during agent training?

ZeroSearch and SSRL demonstrate that LLMs can generate relevant documents and search results from internal knowledge, with 14B simulators matching or exceeding real search engines. Curriculum degradation and test-time scaling optimize this approach for training without API costs.

Can simulated APIs and token-level credit assignment train better tool-using agents?

ToolPO replaces costly real-API interactions with LLM-simulated ones and assigns credit directly to tool-invocation tokens rather than spreading outcome rewards across trajectories. This combination improves training stability and sample efficiency for tool-using agents.

How does test-time scaling work at the agent level?

Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.

Do persistent agents really cost less per token?

A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.

Can small language models handle most agent tasks?

SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.

How much does external API latency dominate total agent execution cost?

Sources 7 notes

Next inquiring lines