Can we measure reasoning quality beyond output plausibility?
How might we evaluate whether AI systems reason internally like humans do, rather than just producing human-like outputs? This matters because surface coherence can mask broken underlying reasoning.
Cognitive science has decades of research on what makes human reasoning distinctive. The Simulating Society Requires Simulating Thought paper distills three defining features. Causal: humans reason in terms of causes and consequences, even young children exhibit Bayesian-like inference over causal relationships and use interventions to test hypotheses, mental models are structured around what caused what. Compositional: human reasoning is modular and reusable, cognitive architectures operate by composing shared schemas (cognitive motifs) that generalize across domains. Revisable: human beliefs evolve dynamically when presented with new information or contradiction, prior assumptions are revised non-monotonically.
These three features ground the formal definition of reasoning fidelity: an agent's ability to construct, simulate, and revise a structured trace of belief formation that mirrors human causal reasoning patterns. The definition is not aesthetic or metaphorical — it produces three measurable properties that map directly to evaluation procedures.
Traceability: the ability to inspect how a belief or stance was formed through intermediate reasoning steps. Operationalized as motif-to-stance inference accuracy — given the motifs an agent claims to hold, does its stated stance follow from them? An agent that produces "I support policy X" without a recoverable chain of motifs supporting that stance fails traceability.
Counterfactual adaptability: the capacity to revise beliefs predictably in response to interventions or changes in context. Operationalized as belief revision under hypothetical scenarios — if you apply do(transparency = high) to the agent's causal belief network, do the downstream posteriors update in the expected direction? An agent whose stance is unmoved by an intervention that should logically shift it fails adaptability.
Motif compositionality: the reuse of modular causal structures across different scenarios or domains. Operationalized as motif reuse across unrelated topics — if a stakeholder reasoned about density and transit before, does asking them about transit-oriented development reuse those motifs without re-training? An agent that regenerates fresh reasoning per query without reusing prior motifs fails compositionality.
The structural shift is from evaluating outputs (does this look like what a human would say) to evaluating internal structure (does the agent reason as a human would). The former rewards mimicry; the latter rewards genuine cognitive modeling. Output-level alignment hits a ceiling because surface coherence does not require internal coherence — the same diagnosis Can identical outputs hide broken internal representations? makes for representations and Should reasoning benchmarks score final answers or reasoning traces? makes for trace-based evaluation.
Source: World Models
Related concepts in this collection
-
Can language models simulate belief change in people?
Current LLM social simulators treat behavior as input-output mappings without modeling internal belief formation or revision. Can they be redesigned to actually track how people think and change their minds?
extends: companion piece — reasoning fidelity is the methodological answer to the behaviorism critique
-
Can we extract causal belief networks from interview conversations?
Can natural language interviews be systematically parsed into causal graphs that capture how individuals reason about policy trade-offs? This matters for building auditable belief simulations that go beyond static opinion snapshots.
exemplifies: CBNs operationalize all three fidelity properties in a runnable pipeline
-
Can causal models alone capture how humans actually reason?
Explores whether causal belief networks provide a complete picture of human cognition or whether associative, analogical, and emotional reasoning modes fall outside their scope.
bounds: RECAP measures causal cognition only — the framework is partial by the authors' own admission
-
Can identical outputs hide broken internal representations?
Can neural networks produce correct outputs while having fundamentally fractured internal structure that prevents generalization and creativity? This challenges our assumptions about what performance benchmarks actually measure.
extends: same surface-vs-structure distinction at the representation level — output equivalence does not imply internal soundness
-
Should reasoning benchmarks score final answers or reasoning traces?
Current reasoning benchmarks often credit plausible-looking reasoning steps even when final answers are wrong. Does measuring outcomes instead of traces reveal whether models actually solve problems, or does it miss important reasoning capability?
tension: opposite move — RECAP measures the trace structure rather than the answer; both responses to the surface-vs-content gap
-
Can LLMs understand concepts they cannot apply?
Explores whether large language models can correctly explain ideas while simultaneously failing to use them—and whether that combination reveals something fundamentally different from ordinary mistakes.
exemplifies: surface-coherence-without-internal-coherence as a documented failure mode
-
Do language model reasoning drafts faithfully represent their actual computation?
If models externalize reasoning in thinking drafts before answering, does the draft accurately reflect their internal process? This matters for AI safety monitoring and error detection.
complements: faithfulness in LRMs decomposes into similar dimensions — internal coherence and answer-determining structure
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
reasoning fidelity has three measurable properties — traceability counterfactual adaptability and motif compositionality — that together replace output plausibility as the evaluation target