How do cache-dominant workflows change the marginal cost of agent tasks?
This explores what happens to the economics of agent work once most of the tokens flowing through a system are cache reads rather than fresh computation — and the corpus suggests the answer is that the unit you should be measuring shifts entirely.
This question is really about a denominator change. Once an agent's context persists and gets reused, the cost of any single task stops being "how many tokens did this take" and becomes "how many finished pieces of work did the accumulated context produce." A 115-day case study makes this concrete: 82.9% of all tokens were cache reads, not fresh generation Do persistent agents really cost less per token?. When that's the regime, the marginal cost of the next task collapses, because the expensive part — building the context — is already paid for. The meaningful cost unit becomes the completed artifact.
The corpus shows several different machineries that produce this cache-dominance, and they're worth seeing side by side. One is reuse of reasoning structure: shared-prefix tree rollouts branch many distinct trajectories off a common cached prefix, so you get more genuinely different attempts per token budget than running independent chains from scratch Can shared-prefix trees reduce redundancy in agent rollouts?. Another is reuse of working memory: recursive subtask trees with rule-based KV-cache pruning sustain accurate reasoning even while discarding 90% of the cache, which lets one model do what used to need a whole multi-agent system Can recursive subtask trees overcome context window limits?. A third is reuse of learned procedure: agents that extract and compound reusable sub-task routines post 24–51% gains, and the gains grow as tasks drift further from training — the cached routine is doing more of the marginal work each time Can agents learn reusable sub-task routines from past experience?.
Here's the part you might not expect: this reframes a lot of the multi-agent coordination debate. Research finds 80% of multi-agent performance variance comes from token budget, not coordination intelligence — performance is mostly a spending function — and shared-KV-cache approaches are precisely what decouples the gains from the spend How does test-time scaling work at the agent level?. So cache-dominance isn't just a cost optimization; it severs the assumed link between "smarter system" and "more tokens." If most of your tokens are cache reads, paying for an elaborate agent swarm buys you less than you'd think.
The natural companion move is to make the non-cached fraction cheap too. Since most agentic subtasks are repetitive and well-defined, small language models handle them at 10–30× lower cost than frontier models, with the big model called only selectively Can small language models handle most agent tasks?. Cache-dominance lowers the marginal cost of reusing context; heterogeneous model sizing lowers the marginal cost of the fresh work that remains. Together they push agent economics toward something closer to amortized infrastructure than per-call billing — which is a quietly large shift in how you'd budget, price, or design these systems.
Sources 6 notes
A 115-day case study found 82.9% of tokens were cache reads. When context persists and reuses, the meaningful cost denominator becomes completed artifacts, not individual tokens.
Tree-structured rollouts that branch from shared prefixes produce more distinct trajectories within a fixed token budget than independent chain sampling. This improves advantage estimation statistics and enables longer-horizon tasks within the same compute constraint.
The Thread Inference Model demonstrates that reasoning structured as recursive subtask trees with rule-based KV cache pruning sustains accurate reasoning beyond context limits, even when manipulating 90% of the cache. This enables single models to replace multi-agent systems by handling full recursive reasoning internally.
Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.
Research shows 80% of multi-agent performance variance comes from token budget, not coordination intelligence. LatentMAS and shared-KV-cache approaches offer ways to decouple performance gains from token costs.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.