What capability boundary exists in LLM prediction of effect sizes?
This explores where LLMs hit a ceiling when asked not just whether an effect happens but how big it is — predicting magnitudes rather than directions.
This explores the boundary between LLMs predicting *that* an effect occurs versus *how large* it is. The corpus suggests the dividing line falls between directional prediction (which result happened, which way a behavior tilts) and quantitative prediction (the magnitude of the effect) — and that the same machinery that makes models strong at the former makes them unreliable at the latter.
The surprising starting point is how good LLMs are at the directional version. Fine-tuned models beat neuroscience experts at predicting which experimental outcomes actually occurred Can LLMs predict novel scientific results better than experts?, and models fine-tuned on psychology experiments out-predict purpose-built cognitive theories at anticipating human decisions Can language models learn to model human decision making?. Both work because LLMs are pattern-integration engines — the very tendency that produces hallucination on backward-looking recall becomes genuine foresight on forward-looking questions. But notice what these tasks reward: getting the *direction* right, not the precise size.
The capability boundary shows up where magnitude is what's being scored. LLMs plateau around 55–60% on genuine constrained-optimization problems regardless of scale, architecture, or reasoning effort — a ceiling, not a scaling gap Do larger language models solve constrained optimization better?. The deeper reason is structural: as autoregressive probability machines, models are systematically worse on tasks whose target outputs are low-probability, even when the task is logically trivial Can we predict where language models will fail?. A precise effect size is, almost by definition, a low-probability specific number rather than a high-probability plausible direction — which is exactly the regime the model is built to struggle with.
There's a second, sharper warning in the corpus: our measurement of effect sizes can itself be an artifact. The "emergent abilities" debate found that sharp, surprising capability jumps dissolve into smooth, predictable curves the moment you swap a discontinuous metric for a continuous one Are LLM emergent abilities real or measurement artifacts?. So part of the boundary isn't in the model at all — it's that the magnitude you think you're predicting depends heavily on how you chose to measure it.
Underneath both lives a knowing-versus-doing gap. Models can explain a concept correctly, fail to apply it, and even recognize their own failure — explanation and execution run on functionally disconnected pathways Can LLMs understand concepts they cannot apply?, and mechanistic work shows higher-tier understanding sitting atop lower-tier heuristics rather than replacing them Do language models understand in fundamentally different ways?. A model can narrate why an effect should be large while its quantitative estimate leans on a shallow heuristic. The thing you didn't know you wanted to know: LLMs are often better forecasters than the experts when the question is *which way*, and quietly unreliable the moment the question becomes *how much* — and sometimes the boundary is drawn by your metric, not the model.
Sources 7 notes
BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.
LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.
Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.
By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.
Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.
Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.
Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.