Language Modeling by Language Models

Paper · arXiv 2506.20249 · Published June 25, 2025

Can we leverage LLMs to model the process of discovering novel language model (LM) architectures? Inspired by real research, we propose a multi-agent LLM approach that simulates the conventional stages of research, from ideation and literature search (proposal stage) to design implementation (code generation), generative pre-training, and downstream evaluation (verification). Using ideas from scaling laws, our system Genesys employs a Ladder of Scales approach; new designs are proposed, adversarially reviewed, implemented, and selectively verified at increasingly larger model scales (14M∼350M parameters) with a narrowing budget (the number of models we can train at each scale). To help make discovery efficient and factorizable, Genesys uses a novel genetic programming backbone, which we show has empirical advantages over commonly used direct prompt generation workflows (e.g., ∼86% percentage point improvement in successful design generation, a key bottleneck). We report experiments involving 1,162 newly discovered designs (1,062 fully verified through pre-training) and find the best designs to be highly competitive with known architectures (e.g., outperform GPT2, Mamba2, etc., on 6/9 common benchmarks).

Our Language Model Architecture Discovery Environment (LMADE) specifically consists of two core resources, a general-purpose knowledge engine that provides access to the academic literature and a verification engine that provides tools for performing model pretraining and evaluation. Our system Genesys then consists of LLM-driven designer agents that propose new research ideas and produce executable architecture designs, and verifier agents that select designs and perform on-the-fly generative pre-training. At the core of Genesys is an evolution tree that stores seed designs and new discovery artifacts. These artifacts are implemented using a special code construct called a generalized autoregressive block (GAB) (Figure 3) that is capable of expressing a wide range of neural architecture types and factorizable into discrete tree representations that allow us to employ efficient genetic programming (GP)-style optimization.

AI in Scientific Discovery AI approaches to ASD have recently proliferated

we focus on the challenging task of neural architecture discovery, which offers a clear objective yet involves many new challenges for ASD.

Neural Architecture Search (NAS) Lastly, we take inspiration from the NAS literature (Chitty- Venkata et al., 2022; White et al., 2023; Elsken et al., 2019; Chen et al., 2023) which has the same aim of discovering improved architectures. Unlike this work, which traditionally searches fixed operation spaces (e.g., attention heads, convolution kernels), we aim for a broader space of operations and architectures and, importantly, attempt to model the broader scientific discovery process. We follow many approaches in NAS that employ genetic programming techniques (GP) (Koza, 1994) and more recent approaches that mix GP with LLMs (Hemberg et al., 2024; Romera-Paredes et al., 2024).

3 Language Model Architecture Discovery As illustrated in Fig. 3⃝1 , standard LMs work by embedding input, then applying N layer or block transformations over that input to produce a final representation (e.g., one that can be used for next token prediction as in autoregressive LMs). Central to any layer/block is a block design, concretely a piece of code BLM, which dictates how information flows through a network. Our goal is to jointly discover novel autoregressive block designs BLM while also modeling the broader research process associated with producing BLM. In this section, we define this problem formally (§ 3.1) and introduce our Language Model Architecture Discovery Environment (LMADE) (§ 3.2) that provides the foundational tools used for discovery and for evaluating block designs BLM.