Do large language models use one reasoning style or many?
Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
The "LLM Strategic Reasoning" paper moves beyond standard NE-based evaluation to apply behavioral game theory across 22 LLMs in diverse strategic scenarios. The core finding: strategic reasoning is not a single capability but a set of distinct reasoning styles, and different models excel through different styles.
Three dominant profiles emerge from thinking chain analysis:
GPT-o1: minimax reasoning. Consistently evaluates options by worst-case outcome. Explicitly states "I will now use minimax" in nearly every chain. Strong in competitive games where minimizing losses aligns with optimal strategy. But becomes overly cautious in cooperative or mixed-motive settings, sometimes assuming the opponent intends to minimize o1's payoff — interpreting cooperation as adversarial.
DeepSeek-R1: trust-based reasoning. Begins with assumptions about opponent's likely action based on self-interest alignment. Works well in cooperative games where incentives are aligned. Exhibits "strategic trust" — assumes opponents won't deviate just to cause harm. But lacks adversarial caution for competitive settings.
GPT-o3-mini: belief-based anticipation. Attempts to infer the opponent's likely move and respond accordingly. Performs well across cooperative and mixed-motive settings but falls back to worst-case logic under uncertainty. The most balanced profile.
Token length inversely correlates with performance. Leaders produce the shortest CoT within their strongest games. Longer reasoning chains signal hesitation and uncertainty, not deeper insight. DeepSeek-R1 in competitive games exhibits "repeated self-doubt in its CoT" that creates redundant reasoning loops inflating tokens without improvement. This independently confirms Why do correct reasoning traces contain fewer tokens? in a completely different domain.
Persona framing shifts reasoning depth. When prompted with demographic personas, some models show measurable changes: female personas increase reasoning depth in GPT-4o, Claude-3-Opus, and InternLM V2, while minority sexuality personas diminish reasoning in Gemini 2.0. The mechanism likely operates through training-corpus statistical associations modulated by RLHF.
The game-type dependence of reasoning profiles extends When does explicit reasoning actually help model performance? by adding strategic interaction as a third domain where task structure determines reasoning effectiveness.
Enrichment (2026-02-22, from Arxiv/Personas Personality): The MBTI-in-Thoughts framework adds personality priming as a strong behavioral variable in strategic games. Thinking-primed agents defect in ~90% of Prisoner's Dilemma rounds vs ~50% for Feeling types. Introverted agents show higher truthfulness (0.54 vs 0.33 for Extraverts) and produce longer, more deliberate rationales. Thinking types switch strategies infrequently (0.07) while Feeling types switch nearly twice as often (0.16). These personality-induced behavioral divergences are statistically significant and align with established MBTI theory, suggesting that game-specific reasoning profiles interact with personality-priming effects — both the game structure AND the agent's personality conditioning shape strategic behavior.
Source: Reasoning Logic Internal Rules
Related concepts in this collection
-
Why do correct reasoning traces contain fewer tokens?
In o1-like models, correct solutions are systematically shorter than incorrect ones for the same questions. This challenges assumptions that longer reasoning traces indicate better reasoning, and raises questions about what length actually signals.
independent cross-domain confirmation: length inversely correlates with quality in strategic reasoning
-
When does explicit reasoning actually help model performance?
Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
strategic reasoning adds a third task type where structure determines effectiveness
-
Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
persona-induced reasoning shifts are consistent with training-corpus statistical associations
-
Why do reasoning models fail under manipulative prompts?
Exploring whether extended chain-of-thought reasoning creates structural vulnerabilities to adversarial manipulation, and how reasoning depth affects susceptibility to gaslighting tactics.
adversarial framing in games parallels multi-turn manipulation vulnerability
-
Do personality types shape how AI agents make strategic choices?
This research explores whether priming LLM agents with MBTI personality profiles causes them to adopt different strategic behaviors in games. Understanding this matters for designing AI systems optimized for specific tasks.
personality priming adds a second dimension to strategic reasoning profiles beyond game type
-
Do iterative refinement methods suffer from overthinking?
Iterative refinement approaches like Self-Refine structurally resemble token-level overthinking in o1-like models. Does revision across multiple inference calls reproduce the same accuracy degradation seen within single inferences?
DeepSeek-R1's "repeated self-doubt" loops in competitive games are the overthinking pattern manifesting in strategic reasoning: sequential revision inflates tokens without improving performance, confirming the failure generalizes beyond math/coding to strategic domains
-
Why do reasoning models fail at theory of mind tasks?
Recent LLMs optimized for formal reasoning dramatically underperform at social reasoning tasks like false belief and recursive belief modeling. This explores whether reasoning optimization actively degrades the ability to track other agents' mental states.
Decrypto ToM benchmark confirms that game-based social reasoning is another fragmented capability where reasoning-optimized models underperform; reinforces the finding that strategic profiles are domain-specific
-
Why do reasoning models struggle with theory of mind tasks?
Extended reasoning training helps with math and coding but not social cognition. We explore whether reasoning models can track mental states the way they solve formal problems, and what that reveals about the structure of social reasoning.
ThoughtTracing confirms that ToM is yet another non-transferable domain; social reasoning requires simultaneous hypothesis tracking not sequential derivation, which is structurally different from both formal and game-strategic reasoning
-
Does more thinking time always improve reasoning accuracy?
Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
DeepSeek-R1's "repeated self-doubt" loops in competitive games instantiate the overthinking threshold in strategic domains: longer chains with redundant cycling reduce accuracy, confirming the non-monotonic token-accuracy relationship extends beyond math/coding to interactive reasoning
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm strategic reasoning profiles differ by game type revealing distinct reasoning styles not a general capability