The gap between what frontier models can do in isolated tasks and what they accomplish over sustained, iterative work remains poorly understood—most benchmarks measure single-turn reasoning or short agent episodes, missing the grinding reality of scientific and engineering progress. AutoLab's contribution is architectural: by forcing models to repeatedly propose, test, measure, and refine artifacts within a wall-clock budget, it surfaces a surprising decoupling between initial capability and persistent optimization ability, suggesting that open-world evaluations of messy long-horizon tasks may reveal different bottlenecks than traditional benchmarks do. Recent work has also shown that short interaction benchmarks do not reliably predict long-horizon workflow performance, and that reinforcement learning can scale to long-horizon multi-turn tasks when credit assignment is properly engineered. The deeper tension: is persistence and time awareness an emergent property of model architecture, training procedure, or prompt design—and can we teach it systematically, or are we still just observing which models happen to stumble into the right behavior by accident?
Abstract
Scientific and engineering progress is fundamentally a long-horizon iterative process: proposing changes, running experiments, measuring outcomes, and continuously refining artifacts. Yet existing benchmarks for frontier models primarily evaluate either single-turn responses or short-horizon agent trajectories, failing to capture the challenges of sustained iterative improvement over extended time horizons. To address this gap, we introduce AutoLab, a new benchmark for ultra long-horizon closed-loop optimization. AutoLab consists of 36 realistic, expert-curated tasks spanning four diverse domains: system optimization, puzzle & challenge, model development, and CUDA kernel optimization. Each task begins with a correct but deliberately suboptimal baseline and challenges agents to improve it within a strict wall-clock budget. Evaluating 17 state-of-the-art models reveals the dominant predictor of success is not the quality of an agent's initial attempt, but its persistence in repeatedly benchmarking, editing, and incorporating empirical feedback. While claude-opus-4.6 exhibits strong long-horizon optimization capabilities, most frontier models, including several proprietary ones, either terminate prematurely or exhaust their budgets with minimal progress. These results underscore the importance of time awareness and persistent iteration in autonomous agents. We open-source the full benchmark, evaluation harness, and task artifacts, to accelerate research toward truly capable long-horizon agents.