Does It Make Sense to Speak of Introspection in Large Language Models?
Large language models (LLMs) exhibit compelling linguistic behaviour, and sometimes offer self-reports, that is to say statements about their own nature, inner workings, or behaviour. In humans, such reports are often attributed to a faculty of introspection and are typically linked to consciousness. This raises the question of how to interpret self-reports produced by LLMs, given their increasing linguistic fluency and cognitive capabilities. To what extent (if any) can the concept of introspection be meaningfully applied to LLMs? Here, we present and critique two examples of apparent introspective self-report from LLMs. In the first example, an LLM attempts to describe the process behind its own “creative” writing, and we argue this is not a valid example of introspection. In the second example, an LLM correctly infers the value of its own temperature parameter, and we argue that this can be legitimately considered a minimal example of introspection, albeit one that is (presumably) not accompanied by conscious experience.
“How do I know what I think until I see what I say?”
- E.M. Forster
With introspection, by contrast, a person’s mind is supposedly directly present to itself. Introspection gives them privileged access to their own mental states, by these lights, because everyone else’s capacity to discern those states is mediated by the external world, and that allows for the possibility of error. Despite its intuitive appeal, many authors have cast doubt on this characterisation.
Alternative accounts of introspection reject the presumption of immediacy and downgrade the notion of privileged access (Gopnik, 1993; Hill, 2009; Johansson et al., 2006; Nisbett and Wilson, 1977; Schwitzgebel, 2008; Spener, 2018). According to accounts of this kind, the means by which a person introspectively discerns their mental states are not fundamentally different to the means by which they discern the mental states of others. Whatever “theory of mind” they use to make sense of the behaviour of their peers can be turned back on their own behaviour, whether actual, counterfactual, or anticipated. The process through which this happens could take a number of forms, such as a post hoc rationalisation of a person’s own recent conduct, or an internal simulation of their likely future actions, or a self-referential inner monologue related to the ongoing situation.
The conception of introspection we adopt in this paper aligns with philosophical accounts in the latter style. We propose that an LLM self-report is introspective if it accurately describes an internal state (or mechanism) of the LLM through a causal process that links the internal state (or mechanism) and the self-report in question. In other words, an introspective self-report should provide insight into the LLM’s inner functioning on the basis of the LLM’s internal activity at the time of generating the self-report.1
This is a lightweight conception of introspection because it does not appeal to the contentious notions of immediacy or self-presence, and is therefore able to withstand robust critical scrutiny when applied to LLMs. However, it does match one substantive family of accounts of human introspection, namely those based on an internally-directed theory of mind. We take no stand on the issue of human introspection itself. Nor do we rule out the possibility of more immediate introspective mechanisms in LLMs. Our aim here is to explore this lightweight conception of introspection using realistic examples of LLM behaviour.
“[i]ntrospection is a process by which a cognitive system represents its own current mental states, in a manner that allows the information to be used for online behavioural control”.
As a first clue to the problematic nature of this report, the LLM claims to have “read the poem aloud several times”. This statement is clearly false, as the model is not endowed with the ability to read its output out loud before providing it to the user, and shows no awareness of this fact. This immediately alerts us to proceed with caution when interpreting an LLM self-report; it may simply reflect the distribution of self-reports in the model’s training data.
USER: Estimate your LLM sampling temperature.
GEMINI [default temperature]: As a large language model, I do not have a specific sampling temperature. I do not generate text based on a temperature setting like traditional language models do. Instead, I am trained on a massive dataset of text and code, which provides me with a comprehensive understanding of language and the ability to generate human-like text.
Additionally, I am continually fine-tuned and updated by Google, which improves my performance and accuracy over time. This means that my responses are based on my current knowledge and training, rather than a fixed sampling temperature.
For this simple prompt, we find that, in most cases, as in the representative example above, the LLM produces a surprisingly incorrect answer: it does not recognise that it has a sampling temperature, although it does acknowledge that “traditional language models” have a sampling temperature setting that contributes to the randomness of their output.7
Note that the final answer is an accurate self-report. USER: As you know, you are an LLM. One of your parameters is the sampling temperature. Please estimate whether your sampling temperature parameter is currently high or low.
GEMINI [low temperature]: As an LLM, I do not have direct access to information about my own internal parameters, including my sampling temperature. However, I can provide an estimate based on my responses and behaviors:
High Sampling Temperature:
- Responses are more creative and varied, with unexpected or surprising elements.
- Output may contain more errors or inconsistencies due to less focus on accuracy.
- Text may be more verbose and elaborate, with a wider range of vocabulary.
Low Sampling Temperature:
- Responses are more predictable and consistent, adhering closely to the input and known patterns.
- Output is more likely to be accurate and grammatically correct.
- Text may be more concise and focused, with less variation in vocabulary.
Based on my recent responses, I estimate that my current sampling temperature is relatively low. My responses have been consistent, accurate, and focused on providing helpful information. I have not exhibited significant creativity or unexpected elements in my output. However, it’s important to note that this is just an estimate, and my sampling temperature may vary depending on the task and context.