Virtuous Machines: Towards Artificial General Science

Paper · arXiv 2508.13421 · Published August 19, 2025

Artificial intelligence systems are transforming scientific discovery by accelerating specific research tasks, from protein structure prediction to materials design, yet remain confined to narrow domains requiring substantial human oversight. The exponential growth of scientific literature and increasing domain specialisation constrain researchers’ capacity to synthesise knowledge across disciplines and develop unifying theories, motivating exploration of more general-purpose AI systems for science. Here we show that a domain-agnostic, agentic AI system can independently navigate the scientific workflow – from hypothesis generation through data collection to manuscript preparation. The system autonomously designed and executed three psychological studies on visual working memory, mental rotation, and imagery vividness, executed one new online data collection with 288 participants, developed analysis pipelines through 8-hour+ continuous coding sessions, and produced completed manuscripts. The results demonstrate the capability of AI scientific discovery pipelines to conduct non-trivial research with theoretical reasoning and methodological rigour comparable to experienced researchers, though with limitations in conceptual nuance and theoretical interpretation. This is a step toward embodied AI that can test hypotheses through real-world experiments, accelerating discovery by autonomously exploring regions of scientific space that human cognitive and resource constraints might otherwise leave unexplored. It raises important questions about the nature of scientific understanding and the attribution of scientific credit.

The end-to-end pipeline includes: (1) a hypothesis formulation engine that identifies potential research questions and testable predictions by searching and validating novelty, breakthrough potential, and feasibility; (2) an experimental protocol engine that designs methodologies, presented as a pre-registration report following Open Science Framework guidelines56, and includes preliminary power analyses as required; (3) an implementation engine, currently interfaced with platforms for cognitive science; (4) a data analysis engine that designs and executes a transparent processing pipeline, covering raw data cleaning, outlier analysis, statistical testing, and interpretation of outcomes; (5) scientific decision-making, specifically synthesising and analysing experimental outcomes through inference frameworks to determine follow-up experiments and/or studies; (6) a visualisation engine which designs and constructs a set of figures and tables to illustrate results collated across experiments; (7) drafting of a complete manuscript incorporating visualisations and validated citations; (8) ‘peer’-style evaluation; and (9) construction of a final formatted manuscript. It achieves a key goal by bridging discovery from in silico computational domains, to the real world, enabling the system to conduct empirical testing of hypotheses with experimental interventions on human participants, and to perform detailed analyses of complex, noisy real-world data. While demonstrated here through cognitive psychology experiments, the architecture employs domain-general principles designed to be applicable across diverse scientific fields and achieves a fundamental goal of the emerging ‘self-driving-laboratory’ paradigm57,58.

Human-Inspired Cognitive Operators

LLMs exhibit broad capabilities62 yet typically struggle with planning over extended durations and self-verification63,64. To address these limitations, we established a foundational cognitive control framework for the system comprising four operators derived from psychological science – abstraction, metacognition, decomposition, and autonomy (Figure 2). These operators coordinate planning, tool use, monitoring, evaluation, and refinement across research workflows, serving as computational analogues of the human executive functions that facilitate complex multi-stage inquiries. Each operator draws upon and extends established techniques, combined within the multi-agent system to support empirical investigation with minimal human oversight.

Abstraction. The process of focusing on general patterns rather than instance-specific details65,66 was operationalised as knowledge induction by enabling agents to develop their own heuristics and instructions rather than constraining them with predetermined directives. Concretely, this involved initial elicitation of latent background premises67, self-driven exploration of problem scope (conceptually related to the ‘Self-Ask’ method68), and automated instruction generation (as validated previously69). By beginning with universal principles, the system maintains a broader conceptual search space for potential scientific insights. This implementation mirrors how human scientists maintain conceptual flexibility when developing novel theories70, allowing exploration across disciplinary boundaries that might otherwise be constrained by specialised training.

Metacognition. Awareness and regulation of one’s own thinking processes71,72 was operationalised at two levels, individual and collective, to assess and refine agent reasoning. While frontier LLMs inherently employ forms of internal test-time compute that dynamically scale with task complexity, the system implements explicit self-evaluation protocols that assess evidence quality, logical coherence, and rigour. At the individual level, this was implemented through self-reflective chains of thought73, enabling each agent to interrogate its underlying assumptions prior to reaching conclusions. At the collective level, agent groups developed awareness of their joint thinking through a reflective process operating on all agents’ reasoning traces, similar to ‘Tree of Thoughts’ inference74, but across several different agents and utilising an external ‘Agentas- a-Judge’75 to assess, refine, and arbitrate those traces to align the group on a decided path. These structured self-reflection mechanisms enhance accuracy in complex reasoning tasks73,76 and facilitate transparent documentation of the evaluation processes.

Decomposition. The breaking down of complex problems into more manageable components77,78 – was operationalised in the framework as explicit structuring of the solution search space. This decomposition enhances the system’s capacity to manage the intricacy of multi-stage scientific workflows while maintaining precision at each step. Specifically, parameterisation of logical reasoning steps, conceptually aligned with ‘least-to-most’ prompting79, identifies constituent task components. This improves the tractability and transparency of multi-stage scientific workflows and enables verification and refinement of each component to maintain step-level precision. In addition, the system’s recursive divide-and-conquer agentic architecture facilitates on-demand subdivision of effort as required, providing flexibility to adapt to challenging tasks80, thereby improving reliability in the production of the required scientific deliverables.

the system successfully pursued new lines of scientific inquiry, from conception through analysis and interpretation, leading to empirical findings on human cognition.

Clarity and fluency in scientific writing. All three manuscripts demonstrated clear, professional scientific writing that adhered to disciplinary conventions and maintained coherence across sections. The writing exhibited appropriate use of technical terminology and followed established formatting standards for scientific publication. Study 3’s ‘Alternative Mechanisms and Neural Efficiency’ section was particularly well-developed, providing an innovative and theoretically grounded interpretation of the results.

Creative and theoretically motivated research questions. The system displayed originality in framing research questions, often exploring relationships not commonly addressed with the given cognitive paradigms. Study 1 explored task order effects and tested multiple model fits to describe the data, demonstrating sophisticated analytical thinking beyond the most obvious research directions. Study 2 investigated the link between mental imagery vividness and serial dependence in VWM and MRT tasks. Study 3 proposed an alternative explanation for null results grounded in individual differences in neural efficiency.

Of particular concern is the potential for producing research outputs at a volume that overwhelms human researchers, necessitating innovative approaches to distil and communicate findings effectively within human cognitive capacity constraints. Notably, autonomous scientific systems are likely to generate a high proportion of null findings, which, while typically remaining unpublished in traditional research146, offer under-utilised value. By documenting non-significant outcomes, these results may mitigate publication bias146, as well as characterise the ”null space” of research fields – highlighting where relationships are absent and conversely where they may exist. This dual benefit reduces wasted resources on redundant experiments and can guide researchers toward better-informed hypotheses for promising investigations.

On a philosophical level, the empirical results of this study raises intriguing questions about the nature of knowledge, particularly the role of understanding in knowledge generation. While traditional epistemological frameworks posit human comprehension as an intrinsic component of knowledge creation150, the system shown here demonstrates that structured scientific inquiry can produce valid empirical knowledge without requiring human-like understanding. Consequently, knowledge may be derived from mechanistic processes without necessitating conscious insight151,152 – a distinction that invites consideration of how we conceptualise scientific knowledge and the processes through which it emerges.

In ∼17 hours of system runtime (excluding data collection), with minimal human oversight, the system conceived, ran, analysed, and produced complete manuscripts for an online psychology experiment with 288 human participants. We replicated this capability across three distinct studies. As far as we are aware, this is the first demonstration of autonomous, end-to-end experimental research with human participants.