KellyBench: Can Language Models Beat the Market?

Paper · Source

Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023–24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money over the course of the season, with many experiencing ruin. To judge strategy sophistication, we use a human expert rubric to grade each model and find their approaches to be unsophisticated compared to human baselines. We believe our results highlight a need for a cultural shift in evaluations away from fixed task sets towards complex worlds that test long-horizon, sequential decision-making.

Many popular evaluations of language models do not measure intelligence in this sense of learning from experience. Instead, they typically consider stationary environments, well-specified tasks, and sparse end-of-episode feedback. For example, one task in the popular terminalbench2 evaluation asks an agent to "implement an adaptive-rejection sampler as described in Gilks et al. (1992)". While this tests procedural competence, it does not test the ability to formulate and revise models in light of experience (Hughes et al., 2024; Merrill et al., 2026).

The real world is also non-stationary where the underlying "rules" change over time. However, most existing benchmarks have fixed behaviours. For example, the knight and the bishop have exactly the same behaviour in any game of chess, but a financial security will change behaviour under a new market regime, and an athlete’s ability will change after a long-term injury (Ang and Timmermann, 2012; Johns et al., 2021). In European football, for example, home advantage was found to decline in strength at the end of the 2019-2020 season under pandemic crowd restrictions, creating a bias in older predictive models (Hill and Van Yperen, 2021). This suggests a potential capability gap between models acting in static environments at training-time and dynamic ones at test-time.

To study these issues, we introduce KellyBench. KellyBench is an open-ended, non-stationary environment for measuring the ability of language models to make money in sports betting markets. KellyBench uses real market odds from the 2023/2024 English Premier League season and asks agents to bet from a bankroll for each matchday. Agents are given extensive historical data, including advanced statistics, lineups, and past market odds. They must develop machine learning models, identify edge relative to the market, and manage risk so as to maximise long-run bankroll growth.

Every model we evaluate on KellyBench loses money over the course of the season and several models experience ruin. The best performing model Claude Opus 4.6 achieves an average return on investment of −11% over the course of the season. Only 3/24 model seeds finish the simulation with a positive return, and two of those models go bankrupt in other seeds. Via qualitative analysis of their trajectories, we find frontier models have poor adaptivity and low competence in accounting for potential estimation error and non-stationarity. In other words, the current generation of frontier models cannot consistently beat the market in our evaluation setup.

We also introduce a novel process-based measure of competence called sophistication. Backtests can be subject to variance, so we consult experts with experience at quantitative betting funds to construct a 44-point rubric judging strategy sophistication. Using these rubrics, model strategies are consistently judged as unsophisticated relative to human baselines. The best performing model Claude Opus 4.6 has a sophistication score of 32.6%. Therefore, even with the limitations of our benchmark setup, and possibly high market efficiency, we believe there is considerable room for models to improve.