Language Understanding and Pragmatics LLM Reasoning and Architecture Psychology and Social Cognition

Can we measure reasoning quality beyond output plausibility?

How might we evaluate whether AI systems reason internally like humans do, rather than just producing human-like outputs? This matters because surface coherence can mask broken underlying reasoning.

Note · 2026-05-03 · sourced from World Models

Cognitive science has decades of research on what makes human reasoning distinctive. The Simulating Society Requires Simulating Thought paper distills three defining features. Causal: humans reason in terms of causes and consequences, even young children exhibit Bayesian-like inference over causal relationships and use interventions to test hypotheses, mental models are structured around what caused what. Compositional: human reasoning is modular and reusable, cognitive architectures operate by composing shared schemas (cognitive motifs) that generalize across domains. Revisable: human beliefs evolve dynamically when presented with new information or contradiction, prior assumptions are revised non-monotonically.

These three features ground the formal definition of reasoning fidelity: an agent's ability to construct, simulate, and revise a structured trace of belief formation that mirrors human causal reasoning patterns. The definition is not aesthetic or metaphorical — it produces three measurable properties that map directly to evaluation procedures.

Traceability: the ability to inspect how a belief or stance was formed through intermediate reasoning steps. Operationalized as motif-to-stance inference accuracy — given the motifs an agent claims to hold, does its stated stance follow from them? An agent that produces "I support policy X" without a recoverable chain of motifs supporting that stance fails traceability.

Counterfactual adaptability: the capacity to revise beliefs predictably in response to interventions or changes in context. Operationalized as belief revision under hypothetical scenarios — if you apply do(transparency = high) to the agent's causal belief network, do the downstream posteriors update in the expected direction? An agent whose stance is unmoved by an intervention that should logically shift it fails adaptability.

Motif compositionality: the reuse of modular causal structures across different scenarios or domains. Operationalized as motif reuse across unrelated topics — if a stakeholder reasoned about density and transit before, does asking them about transit-oriented development reuse those motifs without re-training? An agent that regenerates fresh reasoning per query without reusing prior motifs fails compositionality.

The structural shift is from evaluating outputs (does this look like what a human would say) to evaluating internal structure (does the agent reason as a human would). The former rewards mimicry; the latter rewards genuine cognitive modeling. Output-level alignment hits a ceiling because surface coherence does not require internal coherence — the same diagnosis Can identical outputs hide broken internal representations? makes for representations and Should reasoning benchmarks score final answers or reasoning traces? makes for trace-based evaluation.


Source: World Models

Related concepts in this collection

Concept map
17 direct connections · 141 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

reasoning fidelity has three measurable properties — traceability counterfactual adaptability and motif compositionality — that together replace output plausibility as the evaluation target