How do game-based benchmarks reveal reasoning fragmentation across domains?

This explores what games used as test environments expose about LLM reasoning — specifically how strategic and rule-based games reveal that reasoning doesn't transfer cleanly across domains but splinters into style-specific or instance-specific competencies.

This explores what happens when you put language models inside games — strategy games, rule-inference puzzles — and watch their reasoning come apart at the seams rather than generalize. The corpus suggests games are unusually good at exposing fragmentation because each game type quietly demands a different reasoning style, and models turn out to have favorites.

The sharpest evidence comes from behavioral game theory: across 22 models, distinct strategic 'personalities' emerge tied to game structure rather than raw horsepower Do large language models use one reasoning style or many?. One model leans on minimax (assume the worst-case opponent), another on trust, another on anticipating what you'll do next. Performance tracks which game rewards your native style — so a model can look brilliant in one game and lost in the next. That's fragmentation made visible: there is no single 'reasoning' faculty, there are several, unevenly distributed.

Games also expose a more uncomfortable failure. On exception-based rule inference — games where the trick is recognizing a rule's negative cases — reasoning models scored *below* 25% while plain non-reasoning models hit 55–65% Why do reasoning models fail at exception-based rule inference?. Chain-of-thought actively hurt here, importing math overuse, overgeneralization, and hallucinated constraints. This connects to a broader finding that CoT is distribution-bounded: it produces fluent, confident reasoning that's logically hollow the moment the task shifts shape Does chain-of-thought reasoning actually generalize beyond training data?. Games are good probes precisely because they let you engineer a small distributional shift and watch the reasoning stay fluent while becoming wrong.

What's underneath the fragmentation? Two notes reframe it as not really a reasoning gap at all. One argues models fit *instance-level patterns* rather than general algorithms — they break at the boundary of unfamiliarity, not complexity, so a fresh game instance trips them even when the underlying logic is identical Do language models fail at reasoning due to complexity or novelty?. Another argues that apparent collapses are *execution* failures: the model knows the algorithm but can't carry it out across many steps in text alone, and tool access dissolves the supposed cliff Are reasoning model collapses really failures of reasoning?. Read together, game benchmarks aren't measuring one thing called reasoning — they're measuring style-fit, pattern-familiarity, and procedural bandwidth simultaneously, and labeling the aggregate.

The quietly useful takeaway: the thing that makes a model look like a strong reasoner is partly a training protocol that makes extra tokens productive Can non-reasoning models catch up with more compute? — but that same protocol can backfire on tasks built around exceptions and negative evidence. So when a game benchmark says a model 'can't reason,' the more honest reading is usually: this game asked for a reasoning style this model wasn't trained to deploy.

Sources 6 notes

Do large language models use one reasoning style or many?

Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.

Why do reasoning models fail at exception-based rule inference?

Across four game-based tasks, reasoning models scored below 25% on exception rules versus 55–65% for non-reasoning models. Chain-of-thought introduces math overuse, overgeneralization, and hallucinated constraints that amplify errors in negative evidence recognition.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

How do game-based benchmarks reveal reasoning fragmentation across domains?

Sources 6 notes

Next inquiring lines