Can messy multi-agent transcripts become better training data than clean outputs?
This explores whether the noise in multi-agent transcripts — dead-ends, corrections, failed attempts, disagreements — carries more useful training signal than polished final answers, and the corpus has surprisingly strong, conflicting evidence on both sides.
This reads the question as a bet against the instinct to clean up your data: are the messy intermediate steps of agents arguing, failing, and backtracking actually richer to learn from than the tidy outputs we usually keep? The corpus suggests the answer is a qualified yes — with one sharp caveat about what kind of mess.
The strongest case for mess comes from two findings that attack the value of "clean" from opposite ends. First, curated expert demonstrations turn out to be a ceiling, not a floor: agents trained only on polished static datasets can't learn from their own failures and never generalize past what the curator already imagined, so competence is capped by the dataset author rather than the agent Can agents learn beyond what their training data shows?. Clean outputs, in other words, encode the answer but throw away the search that found it. Second, and more startling, deliberately corrupted reasoning traces train models about as well as correct ones — and sometimes generalize *better* out of distribution — which implies the trace works as computational scaffolding rather than as a transcript of correct thinking Do reasoning traces need to be semantically correct?. If even broken reasoning teaches, then the polish we prize may be cosmetic.
There's also a diversity argument hiding here. RL post-training tends to collapse onto a single dominant format from pretraining within the first epoch, suppressing the alternatives — and the winning format depends on model scale, not on which format actually performs best Does RL training collapse format diversity in pretrained models?. Messy multi-agent transcripts preserve exactly the variety that this collapse destroys: multiple solution paths, competing framings, the record of roads not taken. The mess is the diversity.
But the mess only pays off when failure is *legible*. Reflexion shows agents improving across episodes by storing verbal self-diagnoses in memory — and the reason it works is that the feedback signal is unambiguous (success/failure), which stops the agent from rationalizing its mistakes; the failures have to be clearly labeled as failures to become useful Can agents learn from failure without updating their weights?. The counter-case sharpens this: MetaGPT found that agents coordinating through standardized structured artifacts beat agents exchanging free-form natural language, precisely because conversational noise had to be eliminated for coordination to work Does structured artifact sharing outperform conversational coordination?. And agentic evaluation systems show how unstructured intermediate state turns toxic — an error in one memory module cascaded into everything downstream Can agents evaluate AI outputs more reliably than language models?.
So the corpus resolves the question into a distinction the question itself hides: messy-as-*exploratory* (failed paths, alternative solutions, diagnosed errors) is valuable training signal that clean outputs strip away; messy-as-*unstructured* (ambiguous coordination chatter, uncontained errors) actively degrades both training and runtime. The win condition for using transcripts as training data isn't cleaning them — it's instrumenting them, so that every dead-end is labeled as a dead-end and every failure is legible as a failure.
Sources 6 notes
Agents trained on static expert datasets cannot learn from their own failures or generalize beyond demonstrated scenarios because they never interact with environments during training. Competence is capped by what curators imagined, not by agent capacity.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.
MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.
Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.