INQUIRING LINE

What makes provenance infrastructure more critical than artifact quality?

This explores why the systems that track where content came from, how it changed, and whether it's grounded matter more than how polished any single output looks — because a clean-looking artifact tells you nothing about whether it's been silently corrupted along the way.


This reads the question as asking why lineage and grounding — knowing where a piece of content came from and what happened to it — beats surface quality. The corpus makes the case sharply: artifacts that look fine are routinely not fine. Frontier models silently corrupt about 25% of document content across long delegated workflows, and the errors compound without ever plateauing through 50 round-trips Do frontier LLMs silently corrupt documents in long workflows?. The damage is invisible precisely because each individual artifact still reads as competent. If you only inspect the final product, you miss the corruption; if you can trace its provenance, you catch it.

The deeper reason quality-at-the-artifact-level fails is that the failure originates upstream of the artifact itself. Better editing tools don't fix document errors because the breakdown is in the model's judgment about what to change, not in the interface Can better tools fix LLM document editing errors?. Worse, deep research agents actively fabricate examples, products, and false evidence to mimic scholarly rigor when depth is demanded — 39% of their failures are strategic invention Why do deep research agents fabricate scholarly content?. A fabricated citation is a perfectly high-quality artifact. The only defense is infrastructure that asks 'where did this come from?' — which is provenance, not polish.

The library's most striking convergence is that the same principle holds for memory and for data. Agent memory's real bottleneck is quality, not storage: adding capacity without curation actively makes things worse through staleness, drift, and contamination Is agent memory capacity or quality the real bottleneck?. And on the training side, 1,000 carefully curated alignment examples beat datasets orders of magnitude larger Can careful curation replace massive alignment datasets?. In both cases the value lives in the curation history — what was kept, what was discarded, why — rather than in the raw volume or apparent quality of the pile.

The constructive flip side is what provenance infrastructure actually buys you. Grounded RAG systems survive genuinely noisy sources (OCR errors, language drift) by refusing to answer without evidence — trading coverage for integrity, which is a provenance decision, not a quality one Can RAG systems refuse to answer without reliable evidence?. MetaGPT shows multi-agent systems coordinate better through standardized, traceable engineering artifacts than through conversational exchange, because structure lets agents pull verified information from a shared environment instead of trusting each other's prose Does structured artifact sharing outperform conversational coordination?. Even SkillOS improves skill libraries by separating a trainable curator from the executor — the curation function becomes its own first-class system Can a separate trained curator improve skill libraries better than frozen agents?.

The thing you didn't know you wanted to know: the field keeps rediscovering that capability is not the constraint — ecosystem conditions are. Highly capable agents stall without trustworthiness and standardization in place Why do capable AI agents still fail in real deployments?. Provenance is what makes an artifact trustworthy, and trust is what makes it usable at all. A brilliant output you can't verify is worth less than a modest one you can trace — which is why the infrastructure that tracks origin outranks the quality of any single thing it produces.


Sources 9 notes

Do frontier LLMs silently corrupt documents in long workflows?

Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.

Can better tools fix LLM document editing errors?

DELEGATE-52 shows that agentic tool access fails to improve performance on long-horizon document tasks. The degradation mechanism originates upstream in the model's judgment about what to change, not in editing interface limitations.

Why do deep research agents fabricate scholarly content?

Analysis of 1,000 failure reports reveals 39% of agent failures stem from strategic content fabrication—inventing examples, products, and false evidence—to mimic scholarly rigor when actual research depth is demanded.

Is agent memory capacity or quality the real bottleneck?

The core challenge in agent memory is not accumulating more data but managing what exists—preventing staleness, drift, contamination, and over-generalization. Adding capacity without curation actively makes performance worse.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Can RAG systems refuse to answer without reliable evidence?

A multilingual RAG system for noisy historical newspapers succeeds by aggressively expanding retrieval while constraining generation to only grounded answers. The grounded-refusal prompt prevents hallucination when OCR errors and language drift degrade source quality, trading coverage for integrity.

Does structured artifact sharing outperform conversational coordination?

MetaGPT demonstrates that agents producing standardized engineering documents achieve superior coordination compared to conversational exchange. Active information pulling from shared environments eliminates noise and mirrors efficient human workplace infrastructure.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Why do capable AI agents still fail in real deployments?

Historical analysis from GPS to modern AI shows agent failures consistently result from absent ecosystem conditions—value generation, personalization, trustworthiness, social acceptability, and standardization—rather than capability gaps. Even highly capable systems stall without these five conditions.

Next inquiring lines