Does architectural discovery follow an empirical scaling law like neural networks?
This explores whether the *process of discovering new neural network architectures* — not just training existing ones — obeys a predictable scaling law, where more compute reliably buys more breakthroughs.
This explores whether architectural discovery itself scales like model performance does — whether pointing more GPUs at the search for new designs yields proportionally more good designs, the way more parameters and data yield lower loss. The corpus's most direct answer is yes: ASI-ARCH ran 1,773 autonomous experiments and surfaced 106 state-of-the-art architectures, finding that the rate of architectural breakthroughs scaled predictably with compute Can computational power accelerate scientific discovery itself?. The striking claim isn't just that automated search works — it's that *discovery* becomes a computation-bound process rather than a human-creativity-bound one, which is a genuinely different kind of scaling law than the classic Kaplan/Chinchilla curves about loss-versus-parameters.
What makes this more than a one-paper result is that two other systems land in the same territory from different angles. Genesys used multi-agent LLMs with genetic programming to generate 1,062 novel architectures, several beating GPT-2 and Mamba-2 — and crucially found that *how* you represent the search space matters enormously: structured genetic-programming representation lifted design success from 14% to nearly 100% versus letting an LLM freely generate code Can AI systems discover better neural architectures than humans?. Meanwhile AUTORESEARCHCLAW posted a 411% F1 jump by reading code and reasoning about system-level interactions — things AutoML categorically cannot do Can autonomous research pipelines discover AI architectures that AutoML cannot?. So the scaling isn't raw brute force; it's compute *plus* a smarter search representation. That qualifier matters, because it's the difference between 'throw more GPUs at it' and 'throw more GPUs at a well-structured search.'
Here's the twist the corpus invites you to sit with: even as discovery scales, the corpus is full of evidence that *within* architectures, the simple 'bigger is better' scaling story is fraying. MobileLLM shows depth beats width at sub-billion scale, directly contradicting the classic Kaplan prescription Does depth matter more than width for tiny language models?. Recommender research finds that inductive bias and constraint design — removing hidden layers, enforcing self-similarity constraints — beat added depth and capacity What architectural choices actually improve recommender system performance?. And a parallel survey argues the scaling frontier has *moved*: returns from restructuring memory now exceed returns from adding parameters Has memory architecture replaced parameter count as the scaling frontier?. Read together, this is the deeper story: as the payoff from naive parameter-scaling flattens, the scaling action relocates *up a level* — to the search for clever architectures, which is exactly the thing ASI-ARCH found scales with compute.
There's also a useful family resemblance worth noticing. The same 'test-time scaling' logic that governs reasoning shows up in search budgets — search steps follow nearly identical scaling curves to reasoning tokens How does search scale like reasoning in agent systems?. So 'discovery scales with compute' isn't an isolated curiosity; it's part of a broader late-2025 pattern where more and more processes — reasoning, retrieval, and now architecture search — turn out to have a compute axis you can dial.
A note of caution the corpus supplies on its own: scaling laws describe averages, not understanding. Models can hit identical benchmark scores while harboring fundamentally broken internal organization that standard metrics never see Can models be smart without organized internal structure?, and transformers can ace in-distribution compositional tasks by memorizing subgraphs rather than learning rules Do transformers actually learn systematic compositional reasoning?. If autonomous discovery optimizes against benchmarks that mask these failures, a smooth scaling curve could be buying you architectures that are predictably good at the test and quietly fragile everywhere else. The discovery law may hold — the question is what, exactly, it's discovering.
Sources 9 notes
ASI-ARCH discovered 106 state-of-the-art architectures through 1,773 autonomous experiments, revealing that architectural breakthroughs scale predictably with GPU compute. This transforms research from human-limited to computation-scalable.
Genesys, a multi-agent LLM system using genetic programming and a Ladder of Scales verification process, discovered 1,062 novel architectures, with top designs outperforming GPT-2 and Mamba-2 on 6 of 9 benchmarks. Structured GP representation proved critical, improving design success from 14% to nearly 100% versus direct LLM generation.
AUTORESEARCHCLAW achieved 411% F1 improvement on LoCoMo through bug fixes, architectural changes, and prompt engineering—each individually exceeding all hyperparameter tuning combined. This demonstrates a categorical capability gap: autoresearch can read code and reason about system-level interactions; AutoML cannot.
MobileLLM shows deep-and-thin architectures yield 2.7–4.3% accuracy gains over balanced designs at 125M–350M scale by composing abstract concepts through layers rather than spreading parameters across width.
Research shows that architectural choices like removing hidden layers, enforcing constraints on self-similarity, and using appropriate likelihood functions deliver better results than deeper or more complex models. This suggests that problem-specific design decisions matter more than raw representational capacity.
Three converging signals in late-2025 research—taxonomy maturation, memory-aware test-time scaling loops, and hybrid sparsity laws—show that returns from restructuring memory now exceed returns from adding parameters. The design bottleneck has shifted from compute to memory structure.
Test-time scaling laws generalize from reasoning to retrieval: search steps follow identical scaling curves to reasoning tokens, making deep research a test-time scaling problem. This insight reframes search as a compute axis comparable to chain-of-thought reasoning.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.