Do short benchmarks predict how models perform over long workflows?

Standard LLM benchmarks measure single-turn performance, but real workflows involve sustained delegation across many turns. The question explores whether top benchmark performers maintain accuracy through longer interaction chains.

Note · 2026-05-18 · sourced from Flaws

Most LLM benchmarks evaluate single-turn or short-multi-turn interaction. DELEGATE-52 extends evaluation to 50-round-trip relays and finds that short-interaction performance is not predictive of how the same model behaves under sustained delegation. Models that perform comparably on a single edit can diverge dramatically by relay 25.

This is a methodological finding, not a model finding. The standard practice — pick the top scorer on benchmark X, deploy it in workflow Y — implicitly assumes that capability is roughly stationary across interaction lengths. The relay results show the assumption fails. Models exhibit a degradation curve, and that curve has its own shape parameters (slope, decay rate, recovery behavior under interrupted sessions) that benchmarks built for short tasks cannot expose.

The implication is that "long-horizon performance" deserves status as a distinct evaluation axis, not as a property to be inferred from single-step competence. A model with strong relay-50 retention but mediocre single-turn polish may be more useful for delegated work than the inverse. The paper argues this directly: capability research has been investing heavily in memory management while leaving the underlying long-interaction degradation profile under-measured.

For practitioners, this changes the deployment question from "which model scores highest on X" to "which model maintains accuracy through the interaction length my workflow requires." For benchmark designers, it argues for relay-style evaluations as a default rather than an add-on.

Related concepts in this collection

Do frontier LLMs silently corrupt documents in long workflows? Explores whether advanced language models introduce undetectable errors when delegated multi-step tasks, and whether degradation continues accumulating beyond initial rounds of processing.
same paper, the underlying phenomenon
Are LLM and agent benchmarks really measuring different things? Do LLM benchmarks and agent benchmarks test fundamentally different capabilities, or are they two modes of the same model? Understanding this shapes how we evaluate and develop AI systems.
same paper, complementary methodology implication
Do models fail worse when their own errors fill the context? As a model's prior mistakes accumulate in context, does subsequent accuracy degrade predictably? And can scaling or architectural changes prevent this self-contamination effect?
adjacent: known long-horizon failure mode

Concept map

13 direct connections · 129 in 2-hop network ·dense cluster Open in graph ↗

Do short benchmarks predict how models perform o… Do frontier LLMs silently corrupt documents in lon… Are LLM and agent benchmarks really measuring diff… Do models fail worse when their own errors fill th…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Original note title

short-interaction LLM benchmarks do not predict long-horizon delegated-workflow performance

Do short benchmarks predict how models perform over long workflows?

Related concepts in this collection

Related papers in this collection