Why are post-cutoff test sets essential for evaluating genuine forecasting ability?

This explores why measuring real forecasting means testing models only on outcomes that resolved *after* their training cutoff — and what the corpus reveals once you remove the contamination loophole.

This explores why genuine forecasting can only be measured on questions whose answers didn't exist when the model was trained. The reason is simple but easy to miss: if an event already resolved before a model's training cutoff, the model may have *read the answer* somewhere in its training data. When it then 'predicts' that outcome, you can't tell whether it reasoned forward from evidence or just recalled a fact. Post-cutoff test sets close that loophole — they force the model to predict a future it could not have memorized, which is the only condition under which the word 'forecasting' actually means anything.

The corpus shows this is more than hygiene; it changes the conclusions. A retrieval-augmented system reached near-parity with competitive human forecasters specifically on questions *published after the model's cutoff*, and sometimes beat the human crowd Can retrieval-augmented language models forecast like human experts?. That result is only believable because of the cutoff constraint — the same accuracy on pre-cutoff questions would be unfalsifiable. The same logic scales up in FutureX, a *live* benchmark that pulls questions from 195 sources and waits for real outcomes to resolve, so contamination is structurally impossible Can live benchmarks prevent contamination in prediction tasks?. Once you remove the leak, a surprising finding surfaces: base models handle easy predictions fine, but hard open-ended forecasting demands search-and-reasoning *agents* — forecasting turns out to be an agentic capability, not something baked into raw model weights.

That reframes what these test sets are even measuring. Other work argues the intrinsic forecasting ability of LLMs is stronger than recognized, but only when the workflow separates numerical extrapolation from event-driven contextual reasoning — monolithic prompting hides the capability Can LLMs actually forecast time series better than we think?, Can decomposing forecasting into stages unlock numerical and contextual reasoning?. So a clean post-cutoff benchmark isn't just catching cheaters; it's the instrument that lets you see whether the *architecture around* the model is doing the forecasting work. Without contamination-free questions you'd credit the base model for what the workflow accomplished.

There's a final twist worth knowing: even a model that forecasts accurately on honest post-cutoff data isn't automatically useful. A model can be right on average yet systematically wrong in exactly the states where a decision hinges on it Why do accurate predictions lead to poor decisions?. And where the human bar is low — sparse-signal domains like founder-success prediction — even modest forecasting clears it, which can flatter a model that isn't actually skilled Can language models beat human venture capital experts?. Post-cutoff evaluation is what keeps all of these claims honest: it's the difference between a model that remembers the future and one that can actually reason toward it.

Sources 6 notes

Can retrieval-augmented language models forecast like human experts?

A retrieval-augmented LM system achieved near-parity with competitive human forecasters on real forecasting questions published after model training cutoffs, sometimes surpassing human crowds. Newer model generations naturally improved forecasting without domain-specific tuning.

Can live benchmarks prevent contamination in prediction tasks?

FutureX, a live benchmark collecting questions from 195 sources and verifying real outcomes, shows that base models handle easy predictions but hard open-ended forecasting demands search-and-reasoning agents. This proves forecasting is an agentic capability, not a base-model strength.

Can LLMs actually forecast time series better than we think?

LLMs have stronger intrinsic forecasting ability than recognized, but only when workflows separate numerical reasoning from contextual reasoning. Monolithic prompting obscures this capability; structured decomposition surfaces it.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Why do accurate predictions lead to poor decisions?

Research formalizes necessary and sufficient conditions for predictive models to support optimal decisions. A model can predict accurately on average yet systematically mispredict in decision-critical states.

Can language models beat human venture capital experts?

VCBench shows several LLMs exceed human baselines in founder-success prediction, with DeepSeek-V3 achieving 6× market-index precision. In sparse-signal forecasting where experts only modestly beat chance, even raw LLM capability suffices to clear the human bar.

Why are post-cutoff test sets essential for evaluating genuine forecasting ability?

Sources 6 notes

Next inquiring lines