InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles

Paper · arXiv 2508.16072 · Published August 22, 2025

LLMs have shown strong performance on human-centric reasoning tasks. While previous evaluations have explored whether LLMs can infer intentions or detect deception, they often overlook the individualized reasoning styles that influence how people interpret and act in social contexts. Social deduction games (SDGs) provide a natural testbed for evaluating individualized reasoning styles, where different players may adopt diverse but contextually valid reasoning strategies under identical conditions. To address this, we introduce InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can capture and apply personalized reasoning styles in SDGs. InMind enhances structured gameplay data with round-level strategy traces and post-game reflections, collected under both Observer and Participant modes. It supports four cognitively motivated tasks that jointly evaluate both static alignment and dynamic adaptation. As a case study, we apply InMind to the game Avalon, evaluating 11 stateof- the-art LLMs. General-purpose LLMs, even GPT-4o frequently rely on lexical cues, struggling to anchor reflections in temporal gameplay or adapt to evolving strategies. In contrast, reasoning-enhanced LLMs like DeepSeek-R1 exhibit early signs of style-sensitive reasoning. These findings reveal key limitations in current LLMs’ capacity for individualized, adaptive reasoning, and position InMind as a step toward cognitively aligned human–AI interaction.

Existing benchmarks attempt to assess ToMlike reasoning through tasks such as intent classification (Liu et al., 2024), false-belief attribution (Huang, 2024), and multiple-choice social inference (Seo et al., 2024). However, these methods primarily target output plausibility or behavioral consistency, offering limited insight into underlying cognitive mechanisms, especially those that vary across individuals. In practice, different people often exhibit context-sensitive preferences in subjective scenarios and may arrive at similar conclusions via distinct reasoning trajectories (Otto et al., 2022; Charpentier et al., 2024). We refer to this as an individualized reasoning style.

Social deduction games (SDGs) become an ideal evaluation scenario for internalizing and applying reasoning styles, where players must infer the hidden mental states of others and make strategic decisions accordingly (Zhang et al., 2025; Yoo and Kim, 2024). Due to their dynamic, adversarial and individualized nature (Feng et al., 2024), such settings require more than surface-level alignment: if an LLM cannot capture and adapt to a player’s individualized reasoning style, even plausible output may not support meaningful collaboration. Bridging this gap is essential for advancing ToM-inspired modeling of individual variation in reasoning, and for building LLMs capable of personalized, adaptive inference. We identify two key challenges: (1) how to capture and represent individualized reasoning processes, which may require structured interaction settings and cognitively meaningful annotations; (2) how to evaluate whether an LLM can apply a learned reasoning style in contextually adaptive ways, which calls for fine-grained, cognitively grounded tasks.

To meet these challenges, we propose InMind, a cognitively grounded evaluation framework designed to assess whether LLMs can internalize and apply individualized reasoning styles through SDGs. As illustrated in Figure 1, InMind introduces two complementary gameplay modes: Observer, where a subject reasons passively from the perspective of another player without acting, and Participant, where the subject actively engages in gameplay from their own perspective. This setup not only supports the natural capture of individualized reasoning, but also enables its application and evaluation in dynamic, interactive contexts. Crucially, InMind integrates dual-layer cognitive annotations: (1) strategy traces, which capture real-time reasoning signals such as belief updates, intention inference, and counterfactual thinking; and (2) reflective summaries, offering post-hoc insights that contextualize key game events and assess the behaviors and intentions of other players.

Leveraging these signals, InMind defines four cognitively cognitively motivated tasks to evaluate distinct aspects of individualized reasoning. (1) Player Identification tests whether a model can recognize behavioral patterns that align with a specific reasoning style. (2) Reflection Alignment assesses the model’s ability to ground abstract post-game reflections in concrete game behavior. (3) Trace Attribution probes whether the model can simulate evolving, in-context reasoning across time. (4) Role Inference evaluates whether the model can internalize reasoning styles to support belief modeling under uncertainty.

To concretely investigate these capabilities, we instantiate InMind within the popular social deduction game Avalon1, creating InMind-Avalon, a novel dataset comprising 30 full-session human gameplays annotated with detailed cognitive traces and reflective summaries. Our empirical analysis evaluates 11 state-of-the-art LLMs on Avalon- InMind and highlights several critical limitations: (1) Most models, including GPT-4o, heavily rely on superficial lexical patterns, failing to consistently infer deeper strategic intent; (2) Temporal alignment between reflective reasoning and specific in-game events remains challenging for nearly all evaluated models; (3) Dynamic adaptation of strategic reasoning based on evolving interactions is largely insufficient, indicating fundamental shortcomings in models’ capability for individualized reasoning. Nevertheless, we observe promising potential in certain models, such as DeepSeek-R1, suggesting possible avenues for improvement. Despite the inherent subjectivity in individualized annotations, these cognitively grounded traces and reflections effectively facilitate fine-grained tasks like hidden role identification, highlighting their practical utility for model training and evaluation.