Agentic Systems and Planning

Do short benchmarks predict how models perform over long workflows?

Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.

Note · 2026-05-18 · sourced from Flaws

Most LLM benchmarks evaluate single-turn or short-multi-turn interaction. DELEGATE-52 extends evaluation to 50-round-trip relays and finds that short-interaction performance is not predictive of how the same model behaves under sustained delegation. Models that perform comparably on a single edit can diverge dramatically by relay 25.

This is a methodological finding, not a model finding. The standard practice — pick the top scorer on benchmark X, deploy it in workflow Y — implicitly assumes that capability is roughly stationary across interaction lengths. The relay results show the assumption fails. Models exhibit a degradation curve, and that curve has its own shape parameters (slope, decay rate, recovery behavior under interrupted sessions) that benchmarks built for short tasks cannot expose.

The implication is that "long-horizon performance" deserves status as a distinct evaluation axis, not as a property to be inferred from single-step competence. A model with strong relay-50 retention but mediocre single-turn polish may be more useful for delegated work than the inverse. The paper argues this directly: capability research has been investing heavily in memory management while leaving the underlying long-interaction degradation profile under-measured.

For practitioners, this changes the deployment question from "which model scores highest on X" to "which model maintains accuracy through the interaction length my workflow requires." For benchmark designers, it argues for relay-style evaluations as a default rather than an add-on.

Related concepts in this collection

Concept map
13 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

short-interaction LLM benchmarks do not predict long-horizon delegated-workflow performance