What capability boundary exists in LLM prediction of effect sizes?

This explores where LLMs hit a ceiling when asked not just whether an effect happens but how big it is — predicting magnitudes rather than directions.

This explores the boundary between LLMs predicting *that* an effect occurs versus *how large* it is. The corpus suggests the dividing line falls between directional prediction (which result happened, which way a behavior tilts) and quantitative prediction (the magnitude of the effect) — and that the same machinery that makes models strong at the former makes them unreliable at the latter.

The surprising starting point is how good LLMs are at the directional version. Fine-tuned models beat neuroscience experts at predicting which experimental outcomes actually occurred Can LLMs predict novel scientific results better than experts?, and models fine-tuned on psychology experiments out-predict purpose-built cognitive theories at anticipating human decisions Can language models learn to model human decision making?. Both work because LLMs are pattern-integration engines — the very tendency that produces hallucination on backward-looking recall becomes genuine foresight on forward-looking questions. But notice what these tasks reward: getting the *direction* right, not the precise size.

The capability boundary shows up where magnitude is what's being scored. LLMs plateau around 55–60% on genuine constrained-optimization problems regardless of scale, architecture, or reasoning effort — a ceiling, not a scaling gap Do larger language models solve constrained optimization better?. The deeper reason is structural: as autoregressive probability machines, models are systematically worse on tasks whose target outputs are low-probability, even when the task is logically trivial Can we predict where language models will fail?. A precise effect size is, almost by definition, a low-probability specific number rather than a high-probability plausible direction — which is exactly the regime the model is built to struggle with.

There's a second, sharper warning in the corpus: our measurement of effect sizes can itself be an artifact. The "emergent abilities" debate found that sharp, surprising capability jumps dissolve into smooth, predictable curves the moment you swap a discontinuous metric for a continuous one Are LLM emergent abilities real or measurement artifacts?. So part of the boundary isn't in the model at all — it's that the magnitude you think you're predicting depends heavily on how you chose to measure it.

Underneath both lives a knowing-versus-doing gap. Models can explain a concept correctly, fail to apply it, and even recognize their own failure — explanation and execution run on functionally disconnected pathways Can LLMs understand concepts they cannot apply?, and mechanistic work shows higher-tier understanding sitting atop lower-tier heuristics rather than replacing them Do language models understand in fundamentally different ways?. A model can narrate why an effect should be large while its quantitative estimate leans on a shallow heuristic. The thing you didn't know you wanted to know: LLMs are often better forecasters than the experts when the question is *which way*, and quietly unreliable the moment the question becomes *how much* — and sometimes the boundary is drawn by your metric, not the model.

Sources 7 notes

Can LLMs predict novel scientific results better than experts?

BrainBench benchmarks show fine-tuned LLMs outperform neuroscience experts at predicting which experimental results actually occurred. The same pattern-integration tendency that causes hallucination in retrieval tasks enables genuine prediction in forward-looking scenarios.

Can language models learn to model human decision making?

LLMs finetuned on psychology experiment data predict human behavior more accurately than theory-driven models in decision tasks, capture individual differences in their embeddings, and transfer learning across tasks without task-specific design.

Do larger language models solve constrained optimization better?

Across constrained-optimization tasks, LLMs converge to ~55–60% constraint satisfaction independent of architecture, parameter count, or training regime. Reasoning models do not systematically outperform standard models, suggesting a fundamental ceiling rather than a scaling gap.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Are LLM emergent abilities real or measurement artifacts?

Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Do language models understand in fundamentally different ways?

Mechanistic interpretability reveals conceptual understanding (features as directions), state-of-world understanding (factual connections), and principled understanding (compact circuits). Crucially, higher tiers coexist with lower-tier heuristics rather than replacing them, creating a patchwork of capabilities.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether LLMs can predict effect sizes (magnitudes, not just directions). A curated library spanning 2023–2026 identified a sharp capability boundary: models excel at directional prediction (which outcome? which way?) but plateau around 55–60% on constrained optimization and quantitative magnitude tasks, regardless of scale or reasoning effort.

What a curated library found — and when (dated claims, not current truth):
• LLMs fine-tuned on psychology data out-predict cognitive theories on directional outcomes; models beat neuroscience experts at predicting which experimental results occur (2024–03).
• Autoregressive models systematically fail on low-probability outputs (precise numbers); magnitude prediction is structurally harder than direction prediction because specific effect sizes are inherently low-probability sequences (2023–2024).
• LLMs plateau at 55–60% on genuine constrained-optimization problems; this ceiling holds across architectures and reasoning methods like Tree of Thoughts (2024–04).
• "Emergent abilities" vanish when you replace discontinuous metrics with continuous ones — the boundary may be measurement artifact, not model limitation (2023–04).
• Knowing-versus-doing gap: models explain concepts correctly but fail execution; higher-tier understanding sits atop disconnected heuristics (2025–07, 2026–02).

Anchor papers (verify; mind their dates):
• arXiv:2403.03230 (2024–03): Neuroscience prediction benchmark.
• arXiv:2603.23004 (2026–03): Constraint satisfaction ceiling.
• arXiv:2507.08017 (2025–07): Mechanistic understanding hierarchies.
• arXiv:2304.15004 (2023–04): Emergent abilities as metric artifacts.

Your task:
(1) RE-TEST: Does inference-time compute (chain-of-thought variants, Monte-Carlo sampling, uncertainty quantification, or new decoding schemes since mid-2024) relax the 55–60% constraint-satisfaction plateau on magnitude tasks? Has post-training (RLHF, DPO variants) or architectural changes (sparse models, retrieval-augmented generation) shifted the directional-vs-quantitative boundary? Separate the durable question (can autoregressive models learn precise continuous outputs?) from the perishable limitation (current inference methods can't extract it).
(2) Surface the strongest contradicting work from the last 6 months: any papers showing LLMs reliably forecasting continuous quantities (effect sizes, regression coefficients, confidence intervals) at >70% accuracy, or any mechanistic evidence that the boundary is architectural rather than training-regime dependent.
(3) Propose 2 new research questions: (a) Does scaling the training signal from continuous-valued tasks (rather than next-token prediction) change the directional-vs-quantitative split? (b) Can ensemble or hierarchical decoding (predict direction first, then condition magnitude on direction) exceed the monolithic 55–60% ceiling?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What capability boundary exists in LLM prediction of effect sizes?

Sources 7 notes

Next inquiring lines