Can models recognize how individuals reason differently?
Do language models capture the distinct reasoning paths and strategic styles that individual humans use when reaching the same conclusion? Current evaluations ignore this dimension entirely.
Different people arrive at the same conclusion through distinct reasoning paths. In social deduction games (Avalon), players facing identical information adopt different strategies — some track voting patterns, others read behavioral cues, others use counterfactual reasoning about what different role assignments would imply. These are individualized reasoning styles, and existing ToM evaluation entirely ignores them.
InMind proposes a framework built on dual-layer cognitive annotations: strategy traces capturing real-time reasoning signals (belief updates, intention inference, counterfactual thinking) and reflective summaries offering post-hoc contextualization of key events. Two gameplay modes — Observer (passive reasoning from another player's perspective) and Participant (active engagement) — enable both capturing and evaluating individualized reasoning.
Four tasks evaluate distinct aspects:
- Player Identification: Can the model recognize behavioral patterns aligned with a specific reasoning style?
- Reflection Alignment: Can it ground abstract post-game reflections in concrete game behavior?
- Trace Attribution: Can it simulate evolving in-context reasoning across time?
- Role Inference: Can it internalize reasoning styles to support belief modeling under uncertainty?
The evaluation of 11 LLMs reveals critical limitations. GPT-4o "frequently relies on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies." The model latches onto surface-level language patterns rather than tracking the temporal evolution of reasoning. Temporal alignment between reflective reasoning and specific in-game events "remains challenging for nearly all evaluated models."
DeepSeek-R1 shows "early signs of style-sensitive reasoning" — suggesting that extended reasoning training may begin to capture individualized patterns where standard models cannot. But dynamic adaptation of strategic reasoning based on evolving interactions "is largely insufficient" across all models.
The implication: ToM evaluation that only checks whether the model gets the right answer misses whether it arrived there through a reasoning path that matches the individual it's modeling. Two correct answers can reflect completely different (and incompatible) reasoning styles.
Source: Theory of Mind
Related concepts in this collection
-
Do large language models use one reasoning style or many?
Explores whether LLMs share a universal strategic reasoning approach or develop distinct styles tailored to specific game types. Understanding this matters for predicting model behavior in competitive versus cooperative scenarios.
InMind adds the human-side dimension: not just model-specific reasoning profiles but player-specific trajectories that models fail to capture
-
Does any single persuasion technique work for everyone?
Can fixed persuasion strategies like appeals to authority or social proof be reliably applied across different people and situations, or do they require adaptation to individual traits and context?
individualized reasoning styles are why universal strategies fail in persuasion too: the reasoning path matters, not just the conclusion
-
Why do LLM persona prompts produce inconsistent outputs across runs?
Can language models reliably simulate different social perspectives through persona prompting, or does their run-to-run variance indicate they lack stable group-specific knowledge? This matters for whether LLMs can approximate human disagreement in annotation tasks.
persona instability may explain why LLMs fail at individualized reasoning: they cannot maintain stable models of individual reasoning styles
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
individualized reasoning styles — distinct reasoning trajectories reaching similar conclusions — require cognitively grounded evaluation beyond output matching