Bilevel Autoresearch: Meta-Autoresearching Itself

Paper · arXiv 2603.23420 · Published March 24, 2026

If autoresearch is itself a form of research, then autoresearch can be applied to research itself. We take this idea literally: we use an autoresearch loop to optimize the autoresearch loop. Every existing autoresearch system—from Karpathy’s single-track loop to AutoResearch- Claw’s multi-batch extension and EvoScientist’s persistent memory—was improved by a human who read the code, identified a bottleneck, and wrote new code. We ask whether an LLM can do the same, autonomously. We present Bilevel Autoresearch, a bilevel framework where an outer loop metaoptimizes the inner autoresearch loop by generating and injecting new search mechanisms as Python code at runtime. The inner loop optimizes the task; the outer loop optimizes how the inner loop searches. Both loops use the same LLM—no stronger model is needed at the meta level. On Karpathy’s GPT pretraining benchmark, the meta-autoresearch outer loop achieves a 5× improvement over the standard inner loop alone (−0.045 vs. −0.009 val_bpb), while parameter-level adjustment without mechanism change yields no reliable gain. The outer loop autonomously discovers mechanisms from combinatorial optimization, multi-armed bandits, and design of experiments—without human specification of which domains to explore. These mechanisms succeed by breaking the inner loop’s deterministic search patterns, forcing exploration of directions the LLM’s priors systematically avoid. The core principle is simple: if autoresearch can meta-autoresearch itself, it can, in principle, meta-autoresearch anything with a measurable objective.

Large language models have demonstrated a striking capacity for self-directed scientific iteration: given a task, an LLM can propose a change, execute an experiment, observe the outcome, and decide whether to keep or discard the change. When repeated, this propose–execute–evaluate loop constitutes a form of automated research (Karpathy, 2026). Instantiated for neural network hyperparameter search, we call this loop autoresearch. Despite its promise, autoresearch as currently practiced has a fundamental limitation: the search mechanism is fixed at design time. Every system in the literature uses a human-engineered architecture.

Karpathy (2026) introduced the single-track inner loop with a keep/discard acceptance rule. AutoResearchClaw (AIMing Lab, 2026) extended it with multi-batch parallel search. EvoScientist (EvoScientist Contributors, 2026) added persistent experience memory across runs. A human designed each improvement by reading the prior system’s code, identifying a bottleneck, and writing new code to address it. The systems themselves cannot perform this operation.

This raises a natural question: can an outer loop perform that same design step—reading code, identifying bottlenecks, writing new code—autonomously?

We answer this question affirmatively (fig. 1). We present Bilevel Autoresearch, a bilevel framework with two nested loops: the inner loop optimizes the task (proposing hyperparameter ∗Independent Researcher, EdwardOptimization@gmail.com †Independent Researcher, menglu_16@connect.hku.hk 1changes, training, evaluating, keeping or discarding); the outer loop optimizes how the inner loop searches, by reading its code, identifying bottlenecks, generating new Python mechanisms, and injecting them at runtime. Both loops use the same LLM—any improvement comes from the bilevel architecture, not from a more capable model.

Karpathy (2026) introduced the paradigmatic autoresearch loop for neural network hyperparameter search: an LLM reads a training script, proposes a configuration change, executes training for a fixed budget, measures validation loss, and accepts or rejects the change. Iterated, this constitutes a form of LLM-guided hill climbing in configuration space, where the LLM’s world knowledge serves as an implicit prior over promising changes and training outcomes provide gradient-free feedback.

2AutoResearchClaw (AIMing Lab, 2026) extends this framework with multi-batch parallelism: several candidate configurations are evaluated simultaneously, and the best is promoted. This increases the effective branching factor of search without altering the underlying acceptance mechanism.

EvoScientist (EvoScientist Contributors, 2026) introduces persistent experience memory: lessons from prior runs are summarized and injected into future proposals, enabling cross-run learning. Both of these enhancements were designed by human researchers who inspected the prior system’s code and identified architectural gaps. In all three systems, the structural decisions—when to accept, how to propose, what state to maintain—are made by human designers, not by the system itself.