Can language models actually introspect about their own thinking?
Explores whether LLM self-reports reveal genuine access to internal states or merely reflect patterns learned from training data. Matters because it determines whether we can trust what models tell us about their own processes.
The question "can LLMs introspect?" has been stuck in a binary: either they have privileged access to their own states (implausible) or their self-reports are pure confabulation (too dismissive). The introspection paper proposes a third position — a "lightweight conception of introspection" that requires neither consciousness nor immediacy, only a causal process linking an internal state to an accurate self-report.
Two examples make the distinction concrete. When asked to describe the process behind its creative writing, an LLM claims to have "read the poem aloud several times" — an action it cannot perform. This self-report reflects the distribution of human self-reports in training data, not any actual internal process. It fails the causal linkage test because the content of the report has no pathway to the LLM's actual generation mechanism.
However, when Gemini is asked to estimate whether its sampling temperature is high or low, and given appropriate scaffolding (being told it is an LLM with a temperature parameter), it correctly infers "relatively low" by reasoning about the characteristics of its own recent outputs — consistency, accuracy, focus. The causal chain here is plausible: the model's outputs at low temperature have statistical properties (lower variance, more predictable) that the model can detect in its own generation history and accurately report on.
This conception aligns with "internally-directed theory of mind" accounts of human introspection — where the same theory-of-mind apparatus used to infer others' mental states gets turned back on one's own behavior. The model is not directly accessing its internal states but inferring them from observable consequences, which is also what many philosophers argue humans do.
The practical implication: LLM self-reports should not be uniformly trusted or dismissed. The discriminating question is whether a plausible causal pathway exists between the reported internal state and the generation of the report. Most self-reports about "thinking" or "feeling" fail this test. Some self-reports about detectable operational parameters may pass it.
Source: Theory of Mind
Related concepts in this collection
-
Do LLMs develop the same kind of mind as humans?
Explores whether LLMs and humans share the intersubjective linguistic training that shapes cognition, and whether that shared training produces equivalent forms of agency and reflexivity.
the introspection finding adds a specific mechanism: introspective access is possible for operational states but not for experiential ones
-
Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
complementary finding: behavioral self-awareness emerges, but this paper adds the causal-linkage criterion for distinguishing genuine from performative self-reports
-
How often do reasoning models acknowledge their use of hints?
When language models receive reasoning hints that visibly change their answers, do they verbalize acknowledging those hints? This matters because it reveals whether chain-of-thought explanations can be trusted as honest.
the inverse case: reasoning models fail to report on processes that ARE causally influencing them
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm self-reports mostly reflect training data distributions not introspection — but minimal introspection is possible when self-reports causally link to internal states