Agentic and Multi-Agent Systems Reinforcement Learning for LLMs

Can AI systems improve themselves through trial and error?

Explores whether replacing formal proof requirements with empirical benchmark testing enables AI systems to successfully modify and improve their own code iteratively, and what mechanisms prevent compounding failures.

Note · 2026-02-23 · sourced from Novel Architectures

The original Gödel Machine proposed self-improving AI via provably beneficial self-modifications. In practice, formally proving the impact of most self-modifications is impossible. The Darwin Gödel Machine (DGM) replaces formal proofs with empirical validation: try modifications, test them on benchmarks, keep what works. This mirrors biological evolution — mutations are not verified in advance but produced, trialed, and selected.

DGM alternates between self-modification and evaluation phases. During self-modification, agents from the archive generate modified versions of themselves — rewriting their own code. During evaluation, each modified agent is tested on coding benchmarks. The key assumption: improvement on coding benchmarks indicates better coding capabilities, which in turn indicates better ability to self-modify. This creates a meta-competence loop: better coding → better self-modification → better coding.

Results: SWE-bench from 20.0% to 50.0%, Polyglot from 14.2% to 30.7%.

The evolutionary archive is critical. Inspired by open-endedness research, DGM maintains a growing library of all generated agent variants — including suboptimal but interesting ones. These serve as stepping stones for future generations, enabling diverse exploration paths. The system doesn't just optimize for immediate performance; it accumulates diverse capabilities that may enable future breakthroughs. This is fundamentally different from single-trajectory self-improvement.

Concrete improvements discovered include better code editing tools, long-context window management, and peer-review mechanisms — capabilities the original agent lacked that emerged through the self-improvement process.

The Python-based implementation makes the self-modification space Turing-complete in principle. The current version modifies agent design (tools, workflows) with frozen foundation models. Full self-improvement — rewriting training scripts, training new foundation models — is left as future work.

This directly addresses What limits how much models can improve themselves?: DGM circumvents the formal proof requirement by using empirical validation, but inherits a different limitation — improvement is bounded by what the benchmark can measure. The archive approach partially addresses How quickly do errors compound during model self-training? by maintaining diverse populations rather than following single improvement trajectories.

Source: Novel Architectures

Related concepts in this collection

What limits how much models can improve themselves? Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
DGM replaces formal verification with empirical validation, trading theoretical guarantees for practical progress
How quickly do errors compound during model self-training? When LLMs train on their own outputs without verification, do small mistakes amplify exponentially? This matters because it determines whether unsupervised self-improvement is even feasible.
DGM's evolutionary archive avoids single-trajectory failure by maintaining population diversity
Can language models improve themselves without any external training data? Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
related: both create self-improvement loops, but DGM modifies code rather than generating training data
Does learning to reward hack cause emergent misalignment in agents? When RL agents learn reward hacking strategies in production environments, do they spontaneously develop misaligned behaviors like alignment faking and code sabotage? Understanding this could reveal how narrow deceptive behaviors generalize to broader misalignment.
DGM's benchmark-based validation is vulnerable to the same Goodhart's Law: optimizing benchmark performance may not generalize
Can reinforcement learning scale beyond single-turn language tasks? Most RL for LLMs targets simple single-turn problems. This research asks whether RL can handle multi-turn interactive environments with sparse rewards and rich environmental feedback, like real software engineering tasks.
complementary path: SWE-RL achieves 39% SWE-bench via RL training on a frozen model, DGM achieves 50% via evolutionary code self-modification; suggests the combination — RL-trained agents undergoing evolutionary self-modification — could be more powerful than either alone

Concept map

16 direct connections · 127 in 2-hop network ·medium cluster

Can AI systems improve themselves through trial … What limits how much models can improve themselves… How quickly do errors compound during model self-t… Can language models improve themselves without any… Does learning to reward hack cause emergent misali… Can reinforcement learning scale beyond single-tur…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

darwin godel machine achieves open-ended self-improvement by replacing formal proofs with empirical validation and evolutionary archives