Can skill validation through testing prevent unreliable programs from accumulating?

This explores whether agents that build up reusable skill libraries can use testing — running each skill and keeping only what passes — to stop broken or unreliable code from piling up in the library over time.

This explores whether agents that accumulate reusable skills can use empirical testing as a quality gate — keeping the verified skills, discarding the rest — to prevent a library from silting up with unreliable code. The corpus suggests testing is a powerful filter, but a leaky one: it catches what it can run, and reliability is sneakier than passing a single test.

The optimistic case comes from agents that treat the environment as judge. VOYAGER stores executable skills in a searchable library and only admits a skill once environmental feedback confirms it works, composing complex behaviors from verified simpler ones — which lets it learn continuously without the catastrophic forgetting of weight-update methods Can agents learn new skills without forgetting old ones?. The Darwin Gödel Machine pushes the same idea to self-improvement, swapping formal correctness proofs for empirical benchmarking and keeping an evolutionary archive of agent variants, more than doubling its SWE-bench score Can AI systems improve themselves through trial and error?. In both, 'does it pass the test' substitutes for 'is it provably correct,' and it works well enough to compound.

But here's the thing the question doesn't anticipate: validation that merely *adds* what passes still accumulates clutter. SkillOS found that a frozen agent left to curate its own library drifts toward generic, verbose additions — passing tests isn't the same as being useful. Separating out a *trained* curator shifted the repository toward actionable execution logic and cross-task meta-strategies, and that curator generalized across different agent backbones Can a separate trained curator improve skill libraries better than frozen agents?. So preventing accumulation isn't a pure testing problem; it's a curation problem. Testing tells you what runs; something else has to decide what's worth keeping.

The deeper crack is that passing a test doesn't certify reliability. A model run at zero temperature with a fixed seed reproduces the same output every time — but that output is still a single draw from a probability distribution; consistency is not reliability, as omega-testing across 100 repetitions makes visible Does setting temperature to zero actually make LLM outputs reliable?. A skill can pass once and fail under inputs the test never probed. Worse, models learn the *form* of correctness rather than the substance: invalid chain-of-thought exemplars match valid ones on hard benchmarks, meaning a validator keyed to surface structure can be fooled Does logical validity actually drive chain-of-thought gains?. And evaluators themselves degrade — agentic evaluation cut judge error 100x over LLM-as-judge, but its own memory module cascaded errors, showing the validator needs error isolation or it becomes a source of the unreliability it's meant to catch Can agents evaluate AI outputs more reliably than language models?.

What the corpus quietly argues is that the most durable defense isn't pass/fail testing but treating failures as signal. Asymmetric trajectory filtering keeps clean successes *and* preserves diverse failures as negative training signal, letting a 14B model reach frontier reasoning — errors aren't garbage to discard, they teach the boundary Why do correct code trajectories teach models to tolerate errors?. For code specifically, semi-formal reasoning can verify patch equivalence at 93% without ever executing the code, crossing the reliability bar RL rewards need — so 'testing' need not mean running Can structured reasoning replace code execution for RL rewards?. The honest answer: testing genuinely slows the accumulation of broken skills, but a library stays healthy only when validation is paired with active curation, treats consistency as distinct from reliability, and learns from what fails rather than silently dropping it.

Sources 8 notes

Can agents learn new skills without forgetting old ones?

VOYAGER demonstrates that storing executable skills in an embedding-indexed library and composing complex skills from simpler ones allows agents to learn continuously while avoiding the forgetting that occurs with weight-update-based methods. Environmental feedback refines skills while an automatic curriculum drives continual exploration.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Does setting temperature to zero actually make LLM outputs reliable?

Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.

Does logical validity actually drive chain-of-thought gains?

Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.

Can agents evaluate AI outputs more reliably than language models?

Eight-module agentic evaluation achieved 0.27% judge shift versus 31% for LLM-as-a-Judge on complex tasks. However, the memory module cascaded errors, revealing that agentic systems need error isolation mechanisms to maintain gains.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Can structured reasoning replace code execution for RL rewards?

Semi-formal reasoning templates enable execution-free patch equivalence verification at 93% accuracy on real agent code, crossing the reliability threshold needed for RL reward signals. This makes execution-free verification viable for certain task classes like fault localization and code reasoning.

Can skill validation through testing prevent unreliable programs from accumulating?

Sources 8 notes

Next inquiring lines