Psychology and Social Cognition Language Understanding and Pragmatics

Can language models actually introspect about their own thinking?

Explores whether LLM self-reports reveal genuine access to internal states or merely reflect patterns learned from training data. Matters because it determines whether we can trust what models tell us about their own processes.

Note · 2026-02-22 · sourced from Theory of Mind
How should researchers navigate LLM reasoning research? What happens to social order when AI removes ritual constraints? Why do LLMs excel at social norms yet fail at theory of mind?

The question "can LLMs introspect?" has been stuck in a binary: either they have privileged access to their own states (implausible) or their self-reports are pure confabulation (too dismissive). The introspection paper proposes a third position — a "lightweight conception of introspection" that requires neither consciousness nor immediacy, only a causal process linking an internal state to an accurate self-report.

Two examples make the distinction concrete. When asked to describe the process behind its creative writing, an LLM claims to have "read the poem aloud several times" — an action it cannot perform. This self-report reflects the distribution of human self-reports in training data, not any actual internal process. It fails the causal linkage test because the content of the report has no pathway to the LLM's actual generation mechanism.

However, when Gemini is asked to estimate whether its sampling temperature is high or low, and given appropriate scaffolding (being told it is an LLM with a temperature parameter), it correctly infers "relatively low" by reasoning about the characteristics of its own recent outputs — consistency, accuracy, focus. The causal chain here is plausible: the model's outputs at low temperature have statistical properties (lower variance, more predictable) that the model can detect in its own generation history and accurately report on.

This conception aligns with "internally-directed theory of mind" accounts of human introspection — where the same theory-of-mind apparatus used to infer others' mental states gets turned back on one's own behavior. The model is not directly accessing its internal states but inferring them from observable consequences, which is also what many philosophers argue humans do.

The practical implication: LLM self-reports should not be uniformly trusted or dismissed. The discriminating question is whether a plausible causal pathway exists between the reported internal state and the generation of the report. Most self-reports about "thinking" or "feeling" fail this test. Some self-reports about detectable operational parameters may pass it.


Source: Theory of Mind

Related concepts in this collection

Concept map
14 direct connections · 119 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llm self-reports mostly reflect training data distributions not introspection — but minimal introspection is possible when self-reports causally link to internal states