LLM Reasoning and Architecture Language Understanding and Pragmatics

Can LLMs predict novel scientific results better than experts?

Do language models excel at forecasting experimental outcomes in neuroscience when given only method descriptions? This challenges the assumption that LLMs are mere knowledge retrievers rather than pattern integrators.

Note · 2026-03-28 · sourced from Evaluations
What kind of thing is an LLM really? What do language models actually know?

BrainBench (Luo et al., 2024) creates a forward-looking benchmark where the task is predicting neuroscience experimental results from methods descriptions. Two versions of an abstract — one with real results, one with altered results — test whether the model can identify which results actually occurred.

The finding: LLMs surpass human neuroscience experts at this task. BrainGPT, an LLM fine-tuned on the neuroscience literature, performs better still. Like human experts, when LLMs indicate high confidence, their predictions are more likely to be correct.

The conceptual reframe is the real contribution. Most LLM benchmarks are backward-looking: they test whether models can retrieve or reason about known information. On backward-looking tasks, the model's tendency to "mix and integrate information from large and noisy datasets" is a failure mode — it produces hallucinations. But on forward-looking tasks — predicting novel outcomes — this same tendency becomes a virtue. Integration across noisy, interrelated findings IS what prediction requires.

This means hallucination and prediction may be mechanistically identical: both involve generating outputs that go beyond the literal input by drawing on patterns across training data. The difference is entirely in the task framing. When we ask "what did the paper find?" and the model generates a plausible-but-wrong answer, we call it hallucination. When we ask "what will this experiment find?" and the model generates a plausible-and-right answer, we call it prediction. The underlying computation may be the same.

This has implications for the fabrication/hallucination terminology debate. Since Should we call LLM errors hallucinations or fabrications?, the BrainBench finding suggests fabrication has a productive mode: fabrication in the service of prediction. The model fabricates (generates non-input-grounded content) in both cases — but one fabrication happens to be correct because it aligns with real-world patterns the model has internalized.

The practical implication: evaluating LLMs solely on backward-looking benchmarks systematically underestimates their value for forward-looking scientific tasks. The "practice of science and the pace of discovery would radically change" if LLMs are treated as prediction engines rather than knowledge retrieval systems.


Source: Evaluations

Related concepts in this collection

Concept map
13 direct connections · 123 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

what is hallucination in a backward-looking task is generalization in a forward-looking task — LLMs surpass human experts at predicting neuroscience results