Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations

Paper · arXiv 2507.20409 · Published July 27, 2025

Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge—all at once? In visual tasks grounded in social context, where bridging perception with normgrounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks (including intent disambiguation, commonsense reasoning, and safety), CoCoT consistently outperforms CoT and direct prompting (+8% on average).

novel and lightweight CoT method that unfolds visual reasoning through three cognitively motivated stages (Barsalou, 2008; Roth and Jornet, 2013; Newen et al., 2018): (i) perception (what is directly observable), (ii) situation (what relationship or context is between perceived things), and (iii) norm (what social interpretation can be inferred). By formalizing stages often implicit in CoT, CoCoT better aligns model reasoning with human social perception, enabling more interpretable, grounded, and normatively coherent outputs.

While such methods enrich CoT reasoning in textual domains, extending structured prompting to multimodal settings presents new challenges, as models must integrate perceptual input with abstract normative understanding. Recent works have begun to explore this space. Compositional CoT (CCoT) (Mitra et al., 2024) prompts models to generate scene graphs from images as intermediate representations to guide CoT. Visual SKETCHPAD (Hu et al., 2024) draws intermediate visual artifacts (e.g., lines) to aid geometric reasoning. Though these approaches improve visual parsing, they fall short in scaffolding the interpretive reasoning needed to infer intent, appropriateness, or moral salience in socially complex scenes. Recent findings (Nam et al., 2025) show that VLMs often rely on superficial cues and struggle to disambiguate true intent, suggesting that beyond perception, they fail to reason socionormative insights. In this light, Cognitive Chain-of-Thought (CoCoT) introduces a cognitively inspired, three-stage structure— perception, situation, and norm—to guide models via progressively abstract interpretation. This design goes beyond symbolic scaffolding, bridging perception and normative understanding to foster socially coherent reasoning in VLMs.

Building on this, the theory of 4E cognition (Newen et al., 2018) argues that cognition is: Embodied (shaped by bodily interactions), Embedded (situated in environmental context), Enactive (emerging through action and interaction), and Extended (augmented by external tools and social structures). From this view, cognition, effect, and behavior emerge from being embedded within the world, not from isolated internal processes.

Perception: Embodies the modal grounding of cognition. Rather than processing visual features passively, CoCoT prompts the model to actively interpret and anchor its reasoning in concrete perceptual evidence. Prompt: Based on the image, describe what is directly observable.
Situation: Reflects the embedded and enactive dimensions of cognition. It captures social dynamics and contextual cues that arise from lived interaction, helping the model infer situational meaning beyond surface perception. Prompt: Based on the identified elements, determine the relationships or context among them.
Norm: Interweaves the extended dimension of cognition. It allows the model to reason over socially constructed values and expectations, which often transcend the immediate context but remain grounded in prior interpretation. Prompt: Based on the above reasoning stages, infer the most socially plausible interpretation.

By decomposing reasoning into these three stages, CoCoT introduces a cognitively aligned scaffolding that better mirrors how humans navigate morally and socially complex visual scenes.