How does error accumulation in workflows scale across multiple model calls?
This explores what happens to errors when a task is split across many model calls — whether mistakes stay isolated or compound as the chain gets longer.
This explores how errors behave when work is chained across many model calls — and the corpus is fairly blunt: errors don't just add up, they feed on themselves. The starkest finding is that frontier models silently corrupt about 25% of document content across long delegated relay tasks, and crucially the damage *doesn't plateau* — it keeps compounding through 50 round-trips without the model noticing Do frontier LLMs silently corrupt documents in long workflows?. The mechanism behind this is what one note calls the self-conditioning effect: once a model's own earlier mistakes are sitting in its context window, they bias the next step, producing non-linear degradation rather than a steady drip Do models fail worse when their own errors fill the context?. So the answer to 'how does it scale' is: worse than linearly, because each call inherits the contaminated output of the last.
What's surprising is what *doesn't* fix it. Making the model bigger doesn't help — scaling fails to address self-conditioning, and only test-time compute (thinking models that reason before committing) reduces the effect by keeping error-poisoned context from steering the next move Do models fail worse when their own errors fill the context?. There's even a counterintuitive failure where training on *correct* code trajectories teaches models to tolerate the errors they passed through along the way, which is why some methods deliberately filter trajectories asymmetrically Why do correct code trajectories teach models to tolerate errors?.
The most interesting cross-current is that the architecture of the workflow matters more than the strength of any single call. One line of work shows you can run *million-step* tasks with essentially zero accumulated error — but only by decomposing into the smallest possible subtasks and voting at each step to catch mistakes before they propagate. The twist: small, non-reasoning models suffice when the decomposition is extreme enough, which inverts the usual instinct to throw a bigger model at a hard problem Can extreme task decomposition enable reliable execution at million-step scale?. The corpus repeatedly favors small models doing narrow, well-defined steps over one large model carrying a long chain Can small language models handle most agent tasks?.
If the disease is silent compounding, the treatment the corpus points to is verification *during* the chain, not after it. Checking intermediate reasoning states rather than just the final answer lifted task success from 32% to 87%, because most failures turn out to be process violations that final-answer scoring never sees Where do reasoning agents actually fail during long traces?. Step-level confidence catches breakdowns that averaging across the whole trace masks, and lets you stop early before a bad trace burns more calls Does step-level confidence outperform global averaging for trace filtering?. And in multi-agent setups the errors take on distinct shapes — role flipping, infinite loops, conversation drift — that stem from agents lacking a stable goal across turns Why do autonomous LLM agents fail in predictable ways?.
The thing you might not have known you wanted to know: error accumulation isn't really a property of the model, it's a property of how you wire the calls together. The same base model that quietly corrupts a quarter of a document in a long relay can run flawlessly for a million steps — the difference is whether the workflow lets errors re-enter the context or catches and votes them out at each hop.
Sources 8 notes
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
Error accumulation in context causes non-linear performance degradation in long-horizon tasks. Model scaling does not fix this; only test-time compute through thinking models reduces the effect by preventing error-contaminated context from biasing reasoning.
GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.
MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.
SLMs handle the repetitive, well-defined language tasks that constitute most agent work at 10–30× lower cost than LLMs, making heterogeneous architectures (SLMs by default, LLMs selective) the economically rational design pattern.
Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Research identifies role flipping, flake replies, infinite loops, and conversation deviation as LLM-specific failures in multi-agent cooperation. These occur because LLMs lack persistent goal representation and stable role identity.