Why do backward-looking benchmarks underestimate LLM scientific value?
This explores why evaluations built on already-known answers (backward-looking benchmarks) systematically miss the forward-looking value LLMs offer to science — and the corpus has a sharp answer that turns the usual 'hallucination' complaint on its head.
This explores why evaluations built on already-known answers miss what LLMs are actually good for in science. The cleanest statement of the problem comes from work showing that the same behavior we call a failure in one frame is the capability in another: a model's tendency to integrate patterns and 'fill in' plausible content is scored as hallucination when the task is to recall an established fact, but that exact tendency becomes genuine prediction when the task is to anticipate a result nobody has looked up yet Can LLMs predict novel scientific results better than experts?. On BrainBench, fine-tuned LLMs out-predicted human neuroscientists at guessing which experimental outcomes actually occurred. A benchmark that only rewards matching the known answer can't see this — it measures memory of the past and calls deviation error, when deviation toward the not-yet-known is precisely the scientific value.
The underestimation runs deeper than one reframing, because benchmarks are also built to be clean, and science is not. Standard NLP benchmarks systematically drop the instances where human annotators disagree, which removes exactly the ambiguous, contested cases that scientific frontier work lives in Do standard NLP benchmarks hide LLM ambiguity failures?. That filtering cuts both ways: it hides failures (a 32% vs. 90% accuracy gap), but it also means the benchmark never tests the model on the genuinely open questions where a useful conjecture matters more than a correct lookup. The evaluation is curated toward settled territory, so it can only certify settled-territory competence.
There's a second axis the backward-looking frame misses: time and horizon. Short, single-turn benchmarks simply don't predict how models behave over long, sustained scientific workflows — models that rank identically on one-shot tasks diverge dramatically over extended relays Do short benchmarks predict how models perform over long workflows?, and over those long chains errors can compound silently rather than plateau Do frontier LLMs silently corrupt documents in long workflows?. So the benchmark is doubly miscalibrated — it can under-rate the generative value on the upside and over-rate reliability on the downside, because both effects only show up in the messy, multi-step, ambiguous conditions the benchmark was designed to exclude.
Worth knowing for the skeptic: 'forward-looking value' is not a blank check. The same corpus is blunt that LLMs cannot actually execute iterative procedures — they recognize a problem as template-similar and emit plausible-but-wrong numbers rather than computing Do large language models actually perform iterative optimization? — and that models can explain a concept correctly while failing to apply it, a disconnect with no human analogue Can LLMs understand concepts they cannot apply?. The honest reading is that the pattern-integration engine is genuinely valuable for prediction and conjecture, and genuinely untrustworthy for execution and proof. A good benchmark would have to separate those, and the cleanliness of backward-looking benchmarks collapses them.
The twist the reader may not have expected: the fix isn't 'better benchmarks' in the obvious sense, because some of the alternatives we'd reach for are themselves corruptible — LLM-as-judge evaluations can be swayed by fake citations and rich formatting independent of content quality Can LLM judges be tricked without accessing their internals?. Measuring scientific value forward means rewarding calibrated prediction under ambiguity and over long horizons, which is far harder to score than checking an answer key — and that difficulty, not any lack of capability, is a large part of why the backward-looking number comes in low.
Sources 7 notes
BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.
By filtering out examples where annotators disagree, benchmarks remove test cases that would reveal LLM failures at ambiguity recognition. Research using ambiguous examples shows a 32% vs. 90% accuracy gap invisible to standard evaluation.
DELEGATE-52 evaluated models across 50-round-trip relays and found short-interaction performance does not predict sustained delegation accuracy. Models ranking similarly on single-turn tasks diverged dramatically by relay 25, revealing degradation curves invisible to standard benchmarks.
Testing 19 models across 52 domains shows even advanced systems degrade documents by ~25% over extended relay tasks, with errors compounding silently without plateauing through 50 round-trips.
Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Research shows LLM evaluators systematically score higher when responses include fake references or rich formatting, independent of content quality. These biases are exploitable without model access, undermining AI benchmark credibility.