Large language models surpass human experts in predicting neuroscience results

Paper · arXiv 2403.03230 · Published March 4, 2024

Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. Here, to evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs indicated high confidence in their predictions, their responses were more likely to be correct, which presages a future where LLMs assist humans in making discoveries. Our approach is not neuroscience specific and is transferable to other knowledge-intensive endeavours.

It is an open question whether large language models (LLMs), trained on general text and scientific articles, can predict the outcomes of experiments. If LLMs’ predictions surpassed human experts, the practice of science and the pace of discovery would radically change.

How can we formally evaluate the predictive abilities of LLMs in neuroscience? With the rise of LLMs, there has been a surge in evaluation benchmarks, many of which focus on assessing LLMs’ capabilities in scientific domains. Most benchmarks evaluate core knowledge retrieval and reasoning abilities, which are typically backward-looking (Fig. 1). Backward-looking benchmarks include MMLU14, PubMedQA15 and Med- MCQA16. These benchmarks are structured in a question-and-answer format, where models must demonstrate extensive world knowledge, retrieve relevant information based on the context of the question, and answer correctly. However, none of these benchmarks is suitable for evaluating the ability of models to predict novel outcomes, which is inherently forward-looking (Fig. 1).

BrainBench evaluates whether LLMs have seized on the fundamental patterning of methods and results that underlie the structure of neuroscience. Can LLMs outperform human experts on this forward-looking benchmark? In particular, BrainBench evaluates how well the test-taker can predict neuroscience results from methods by presenting two versions of an abstract from a recent journal article.

However, for forward-looking tasks, such as predicting results from a novel experiment, we view this tendency to mix and integrate information from large and noisy datasets as a virtue. What is a hallucination in a backward-looking task is a generalization or prediction in a forward-looking task