INQUIRING LINE

What causes language models' strategic rationality to decline with increased game complexity?

This explores why LLMs play games less rationally as the games get more complex — and the corpus suggests the real culprit may not be complexity at all, but unfamiliarity, missing scaffolding, and reasoning shortcuts that only look like rationality.


This explores why LLMs play games less rationally as the games get more complex. The obvious answer — that bigger game trees overwhelm the model's compute — turns out to be only half the story, and the more interesting half is that the decline may not be about complexity per se at all.

The surface phenomenon is well documented: models frequently fail to compute Nash equilibria, and their play drifts further from optimal as games grow Do language models make rational strategic decisions in games?. But a sharper diagnosis reframes the cause entirely — reasoning models break not at complexity *thresholds* but at *novelty* boundaries. They fit instance-based patterns rather than learning a generalizable algorithm, so a long, hard reasoning chain still succeeds if the model has seen similar instances, while a short, simple one fails if it's unfamiliar Do language models fail at reasoning due to complexity or novelty?. Under this view, 'complex' games decline because complexity correlates with unfamiliarity, not because the model runs out of strategic horsepower.

A second cause is that what looks like strategic reasoning is often a heuristic wearing reasoning's clothes. Most models actually perform *worse* when constraints are removed — they were defaulting to the harder, more conservative option rather than evaluating the situation, so stripping away the constraint that propped up that default exposes the absence of real reasoning Are models actually reasoning about constraints or just defaulting conservatively?. Complexity tends to add degrees of freedom that defeat such shortcuts, which is why rationality erodes exactly where the crutch disappears. Relatedly, different models lean on different fixed reasoning styles — minimax, trust-based, belief-anticipation — and performance tracks how well a style happens to fit the game's structure rather than raw reasoning depth Do large language models use one reasoning style or many?. A complex game that mismatches a model's native style will look like a complexity failure but is really a style failure.

The third cause is a memory-and-state problem. Strategic play in richer games demands tracking an evolving history and an opponent's shifting strategy, and models are bad at this without help: across even simple bandit environments, only GPT-4 *with* explicit prompting, chain-of-thought, and external history summarization explores competently — without summarization, models cannot aggregate unstructured interaction history into good decisions Why do LLMs struggle with exploration in simple decision tasks?. The same brittleness shows up in dynamic games, where models cling to surface lexical cues and fail to anchor reasoning in the temporal flow of play or adapt to an opponent who changes Can models recognize how individuals reason differently?. Complexity multiplies state to track, and that's where unaided in-context reasoning collapses.

The most useful takeaway: the decline is largely *fixable from the outside*. Structured game-theoretic workflows that scaffold the reasoning steps restore near-optimal play and reduce exploitability even on hard negotiations Do language models make rational strategic decisions in games?, and external summarization plus explicit exploratory hints rescue exploration Why do LLMs struggle with exploration in simple decision tasks?. That's the tell that the bottleneck isn't a missing capacity for strategy but a missing structure for deploying it — the model often *has* the rationality and fails to organize it on its own as the game gets bigger.


Sources 6 notes

Do language models make rational strategic decisions in games?

LLMs frequently fail to compute Nash equilibria, with worse performance as game complexity increases. Structured game-theoretic workflows guide reasoning toward optimal strategies, reducing exploitability and enabling near-optimal negotiation outcomes.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Do large language models use one reasoning style or many?

Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.

Why do LLMs struggle with exploration in simple decision tasks?

Across multi-armed bandit environments, only GPT-4 with explicit exploratory hints, external history summarization, and chain-of-thought reasoning achieves satisfactory exploration. Without external summarization, models cannot reliably track and aggregate unstructured interaction history to guide exploratory decisions.

Can models recognize how individuals reason differently?

LLMs struggle to anchor reasoning in temporal gameplay and adapt to evolving strategies. GPT-4o relies on surface lexical cues while DeepSeek-R1 shows early promise, but dynamic style adaptation remains largely insufficient across all models tested.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a game-theoretic AI researcher re-examining why LLMs' strategic rationality declines with game complexity. The question remains open: is the bottleneck computational capacity, reasoning architecture, or scaffolding?

What a curated library found — and when (dated claims, not current truth):
Findings span Feb 2024–Mar 2026. The library identified three linked causes:
• Instance-level unfamiliarity, not task complexity per se, drives reasoning breakdown; models fit surface patterns rather than learn generalizable algorithms (~2025).
• Conservative bias and fixed reasoning styles (minimax, trust-based, belief-anticipation) mask the absence of true strategic evaluation; models perform *worse* when constraints are removed, revealing heuristic shortcuts (~2026).
• State-tracking and in-context history aggregation fail without external summarization and explicit prompting; only GPT-4 with chain-of-thought + structured history scaffolding explores competently in bandit and dynamic games (~2024).
• Structured workflows and external summarization reduce exploitability and restore near-optimal play, suggesting the bottleneck is *organization* of existing capacity, not missing capability (~2024–2025).

Anchor papers (verify; mind their dates):
• arXiv:2403.15371 (Mar 2024) — exploration failure without external scaffolding.
• arXiv:2411.05990 (Nov 2024) — game-theoretic workflows restore rationality.
• arXiv:2502.20432 (Feb 2025) — behavioral game theory study of strategic reasoning profiles.
• arXiv:2603.29025 (Mar 2026) — surface heuristics override implicit reasoning constraints.

Your task:
(1) RE-TEST EACH CONSTRAINT. For each finding above, assess whether test-time compute scaling (latent reasoning depth, chain-of-thought variants), in-context learning advances, or new evals have since relaxed or overturned the claim. Distinguish the durable question (why do models struggle to *generalize* strategy?) from perishable limitations (e.g., "no external summarization" — now often built-in). Cite what resolved it; flag what still holds.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months — especially papers showing models *do* learn generalizable game-theoretic reasoning, or that complexity itself is solvable by scale/training alone.
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Given test-time compute advances, is the real bottleneck now the *diversity* of game structures a model encounters, rather than depth?" or "Can models learn to *meta-reason* about when to invoke structured workflows?".

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines