Can AI systems discover better neural architectures than humans?
Can multi-agent LLM systems, when structured with genetic programming, discover novel neural network designs that outperform human-engineered architectures? This matters because it could automate a critical bottleneck in AI research.
Genesys models the conventional stages of research — ideation, literature search, code generation, pretraining, evaluation — as a multi-agent LLM system. The key innovation is the Ladder of Scales approach: new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M→350M parameters) with a narrowing budget at each scale.
The genetic programming (GP) backbone is critical. Rather than using LLMs to directly prompt-generate architectures (which has an ~86% failure rate), Genesys represents architectures as Generalized Autoregressive Blocks (GABs) — a code construct factorizable into discrete tree representations. GP-style operations (crossover, mutation) on these trees produce meaningful architectural variations far more reliably than direct generation.
Results: 1,162 newly discovered designs (1,062 fully verified through pretraining). The best designs outperform GPT-2, Mamba-2, and other known architectures on 6/9 common benchmarks. This is achieved through a principled search process, not brute-force sampling.
The system architecture mirrors human research:
- Designer agents: Propose research ideas and produce executable architecture designs
- Verifier agents: Select designs and perform pretraining
- Evolution tree: Stores seed designs and discovery artifacts, enabling cumulative progress
Unlike traditional Neural Architecture Search (NAS) which searches within human-defined operation spaces (attention heads, convolution kernels), Genesys searches a broader space of operations and architectures while modeling the broader scientific discovery process.
The factorization into GP-representable trees is the insight that makes this practical: it provides structure to the search space that direct LLM generation lacks. The ~86% improvement in successful design generation from GP vs. direct prompting suggests that current LLMs need structured representations to do creative design work reliably — they cannot yet reliably generate novel working architectures from freeform description alone.
Source: Novel Architectures
Related concepts in this collection
-
Can computational power accelerate scientific discovery itself?
Does the pace of research breakthroughs scale with computing resources, like model performance does? ASI-ARCH tested this by running thousands of autonomous experiments to discover neural architectures.
ASI-ARCH and Genesys are parallel demonstrations of the same principle: automated architecture discovery at scale
-
Do language models generate more novel research ideas than experts?
Explores whether LLMs can break free from expert constraints to generate more novel research concepts. Matters because novelty is often thought to be AI's creative blind spot.
Genesys addresses feasibility through GP structure; direct LLM generation fails 86% of the time
-
Why do LLMs generate novel ideas from narrow ranges?
LLM research agents produce individually novel ideas but cluster them in homogeneous sets. This explores why high average novelty coexists with poor diversity coverage and what it means for automated ideation.
GP backbone forces structural diversity that direct prompting cannot maintain
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
multi-agent LLM systems discover novel neural architectures competitive with human-designed ones through genetic programming