What Does It Take to Be a Good AI Research Agent? Studying the Role of Ideation Diversity
AI research agents offer the promise to accelerate scientific progress by automating the design, implementation, and training of machine learning models. However, the field is still in its infancy, and the key factors driving the success or failure of agent trajectories are not fully understood. We examine the role that ideation diversity plays in agent performance. First, we analyse agent trajectories on MLE-bench, a well-known benchmark to evaluate AI research agents, across different models and agent scaffolds. Our analysis reveals that different models and agent scaffolds yield varying degrees of ideation diversity, and that higher-performing agents tend to have increased ideation diversity. Further, we run a controlled experiment where we modify the degree of ideation diversity, demonstrating that higher ideation diversity results in stronger performance. Finally, we strengthen our results by examining additional evaluation metrics beyond the standard medal-based scoring of MLE-bench, showing that our findings still hold across other agent performance metrics.
In particular, recent work on autonomous AI research agents (Shen et al., 2023; Huang et al., 2024; Toledo et al., 2025; Zhao et al., 2025) improves upon automated machine learning engineering tools (Feurer et al., 2022) by mirroring the cognitive processes of human researchers through a structured research pipeline: idea generation and hypothesis setup, experimental design and implementation, empirical validation, and iterative refinement. Recent advances have achieved notable milestones such as creation of the first fully autonomous AI-generated research paper accepted through peer review (Yamada et al., 2025).
Despite the potential of these recent breakthroughs in automating AI science, the field is still in its infancy and little is understood about the factors driving their successes and failures. Error analysis is substantially more complicated than in classic machine learning setups, due to the presence of long multi-step trajectories often guided by heuristic-based search algorithms (Toledo et al., 2025) and leveraging tool use, which requires complex evaluation frameworks. Moreover, obtaining large-enough samples to perform meaningful analysis and ablate design choices can be computationally prohibitive.
This paper starts from the postulate that ideation diversity is a key bottleneck in AI research agents’ performance. To study this hypothesis, we face two key challenges: analyzing complex agentic trajectories at scale, and measuring and controlling ideation diversity.
What does it take to be a good AI research agent? We can imagine a hypothetical future scenario where excellent AI research agents ideate brilliant experiments and have outstanding coding skills to implement them. Until we get to this ideal situation, in practice, even state-of-the-art AI research agents will exhibit limited ideation and implementation capabilities, particularly when evaluated in challenging, real-world settings. In this imperfect, yet realistic, scenario, given the same level of capabilities, we prefer agents with greater ideation diversity. First, because it de-risks implementation pitfalls. Our analysis of the controlled experiment shows that one reason why diversity is important is that it helps agents design solutions they are actually able to execute, highlighting the interplay between ideation and implementation. If the different proposed plans by the agent rely on similar approaches, and those happen to be hard to implement by the agent (in the context of the particular task), then we risk low implementation accuracies. Intuitively, a potential second argument for ideation diversity is that given the difficulty of coming up with creative, yet feasible research ideas, exploring significantly different paths hedges against pursuing a single unproductive direction (even if the agent knows how to implement it), and enables agents to more effectively explore the solution space of machine learning problems. We want to invest the allocated compute in a diversified, yet plausible, set of ideas. However, this second reason is hard to evaluate given the implementation bottleneck. Ultimately, a good experimentation plan could fail due to the agent being unable to implement it. Repeating these controlled experiments as LLMs’ coding capabilities get increasingly more powerful may yield valuable insights.
Importance of the implementation bottleneck. Unsurprisingly, implementation quality is an important bottleneck of AI research agents. We observe a strong correlation between AI research agents’ performance and the ability to implement sufficiently complex solutions. By aggregating performances of AIRA (Greedy and MCTS) for each LLM, Figure 10 shows that, on average, the more time an agent spends on each successfully implemented solution (including ideation, implementation, and model training), the more medals it earns. This suggests that performance increases with the agents’ ability to implement more complex solutions. Furthermore, Figure 11 shows that agents perform better when, out of the 24 hours allotted to complete a task, they spend a higher proportion of time on successfully implemented solutions. However, since LLMs and coding agents are improving rapidly (Kwa et al., 2025), particularly in verifiable tasks (DeepSeek-AI et al., 2025), we hypothesize that the relative importance of the ideation and planning phase might increase over time, not to de-risk implementation pitfalls, but to efficiently explore the solution space.