LLM Reasoning and Architecture Language Understanding and Pragmatics

Can LLMs predict novel scientific results better than experts?

Do language models excel at forecasting experimental outcomes in neuroscience when given only method descriptions? This challenges the assumption that LLMs are mere knowledge retrievers rather than pattern integrators.

Note · 2026-03-28 · sourced from Evaluations

BrainBench (Luo et al., 2024) creates a forward-looking benchmark where the task is predicting neuroscience experimental results from methods descriptions. Two versions of an abstract — one with real results, one with altered results — test whether the model can identify which results actually occurred.

The finding: LLMs surpass human neuroscience experts at this task. BrainGPT, an LLM fine-tuned on the neuroscience literature, performs better still. Like human experts, when LLMs indicate high confidence, their predictions are more likely to be correct.

The conceptual reframe is the real contribution. Most LLM benchmarks are backward-looking: they test whether models can retrieve or reason about known information. On backward-looking tasks, the model's tendency to "mix and integrate information from large and noisy datasets" is a failure mode — it produces hallucinations. But on forward-looking tasks — predicting novel outcomes — this same tendency becomes a virtue. Integration across noisy, interrelated findings IS what prediction requires.

This means hallucination and prediction may be mechanistically identical: both involve generating outputs that go beyond the literal input by drawing on patterns across training data. The difference is entirely in the task framing. When we ask "what did the paper find?" and the model generates a plausible-but-wrong answer, we call it hallucination. When we ask "what will this experiment find?" and the model generates a plausible-and-right answer, we call it prediction. The underlying computation may be the same.

This has implications for the fabrication/hallucination terminology debate. Since Should we call LLM errors hallucinations or fabrications?, the BrainBench finding suggests fabrication has a productive mode: fabrication in the service of prediction. The model fabricates (generates non-input-grounded content) in both cases — but one fabrication happens to be correct because it aligns with real-world patterns the model has internalized.

The practical implication: evaluating LLMs solely on backward-looking benchmarks systematically underestimates their value for forward-looking scientific tasks. The "practice of science and the pace of discovery would radically change" if LLMs are treated as prediction engines rather than knowledge retrieval systems.

Source: Evaluations

Related concepts in this collection

Should we call LLM errors hallucinations or fabrications? Does the language we use to describe LLM failures shape the technical solutions we build? Examining whether perceptual and psychological frameworks misdiagnose what's actually happening.
fabrication has a productive mode when task is forward-looking
Can any computable LLM truly avoid hallucinating? Explores whether formal theorems prove hallucination is mathematically inevitable for all computable language models, regardless of their design or training approach.
formal inevitability may be a feature for prediction tasks, not just a bug
Why do LLMs struggle to connect unrelated entities speculatively? LLMs reliably organize and summarize evidence but fail when asked to speculate about connections between dissimilar entities. Understanding this failure could reveal fundamental limits in how models handle complex analytical reasoning.
BrainBench suggests predictive organization CAN succeed where speculative connection fails
Do foundation models learn world models or task-specific shortcuts? When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
prediction success doesn't require world models; heuristic integration of patterns suffices

Concept map

13 direct connections · 123 in 2-hop network ·dense cluster

Can LLMs predict novel scientific results better… Should we call LLM errors hallucinations or fabric… Can any computable LLM truly avoid hallucinating? Why do LLMs struggle to connect unrelated entities… Do foundation models learn world models or task-sp…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

what is hallucination in a backward-looking task is generalization in a forward-looking task — LLMs surpass human experts at predicting neuroscience results