What we talk to when we talk to language models

Paper · Source

David Chalmers

Linguistics, NLP, NLU Role Play Philosophy Subjectivity

Quasi-interpretivism does not say anything about whether LLMs have beliefs and desires. But it does make it plausible to say that LLMs have quasi-beliefs and quasi-desires, on the grounds that LLMs are at least interpretable in the right way. Even if quasi-beliefs and quasi-desires fall short of being genuine beliefs and desires, they can still play some of the key roles of beliefs and desires in explaining behavior. For example, if an LLM quasi-believes that adopting a certain strategy would be the most helpful thing it could do to solve a problem, and it quasi-desires to do the most helpful thing it can, then other things being equal, it will adopt that strategy. Quasi-interpretivism is open to advocates and opponents of interpretivism alike. Interpretivists will simply add the claim that quasi-beliefs are genuine beliefs. Opponents will add the claim that quasi-beliefs are far from genuine beliefs; perhaps they are merely pseudo-beliefs. (“Quasi-belief” should be heard as “apparent belief” or “seeming belief” rather than as “almost belief”.) Quasiinterpretivism does not take a position in this dispute, but it adds a common core on which these disagreeing parties can at least sometimes agree.

Quasi-interpretivism itself is a stipulative framework rather than a substantive view. But it’s a substantive claim that this framework is useful for various purposes. For example, appeal to quasi-belief and quasi-desires can be useful in predicting a system’s behavior. If a system (human, machine, something else) quasi-desires a certain goal and quasi-believes that a certain action will achieve that goal, then other things being equal, it will perform that action. It is also relatively tractable to apply the framework: because quasi-beliefs and quasi-desires depend only on behavioral dispositions, they are much easier to detect and analyze than beliefs understood in a way that depends on consciousness and opaque internal mechanisms. At the same time, understanding a system’s quasi-beliefs and quasi-desires can be at least a stepping-stone to understanding its beliefs and desires in a more full-blown sense.

It is worth keeping in mind that quasi-beliefs and quasi-desires are cheap. They need not involve humanlike mental states or any mental states at all. A Roomba vacuum cleaner with a map is behaviorally interpretable as believing that the apartment occupies a certain space and as desiring to traverse that space. A corporation such as OpenAI is behaviorally interpretable as desiring to create AGI and believing that certain systems are the best path to AGI. Likewise, an LLM is behaviorally interpretable as believing that a certain airline has the cheapest flights to Paris and as desiring to help the user by telling them this.

An opponent might deny that LLMs have quasi-beliefs or quasi-desires on the grounds that LLM behavior is unstable, or non-humanlike, or otherwise defective in a way that means that the LLM is not even usefully interpretable in terms of beliefs or desires. Interpretability requires a certain amount of consistency over time, and LLMs can be inconsistent in their behavior. But they are also consistent in many domains. A core of consistency is enough for interpretation to get a grip in ascribing numerous quasi-beliefs and quasi-desires, even though there will be domains where they lack these states on grounds of inconsistency. Overall I think that experience with current LLMs suggests that there is enough consistency to support a reasonably extensive core of quasi-beliefs.

I will not say a great deal about the question of just which quasi-beliefs and quasi-desires LLM interlocutors have. Understanding this sort of LLM quasi-psychology is best addressed through empirical study of language models. Importantly, I am not suggesting that LLM quasi-psychology is similar to human quasi-psychology. I think they are very different. But the framework at least allows us to address the question.

So, I will take as a starting point the claim that LLM interlocutors at least have quasi-beliefs and quasi-desires. This claim is not entirely neutral in that it is possible to deny it, but I think the interpretability claim is weak enough and plausible enough that a majority of people can accept it. We might say that an entity with quasi-beliefs and quasi-desires is at least a quasi-agent or a quasi-subject.5 If it is interpretable as making utterances and assertions, we can also say that it is a quasi-speaker, who makes quasi-utterances and quasi-assertions.

One can in principle extend quasi-interpretivism to any mental states. We can say that a system quasi-fears that p if it is behaviorally interpretable as fearing that p and that a system quasi-feels pain if it is behaviorally interpretable as feeling pain.

We can even say that a system is quasi-conscious if it is behaviorally interpretable as being

conscious.

There are some further natural requirements. A persistent LLM interlocutor will produce all the outputs that the LLM seems to produce, and will process all the inputs that the LLM seems to produce. A coherent LLM interlocutor will be consistent enough to serve as a quasi-subject, with coherent quasi-beliefs and quasi-desires that help make sense of its actions. A faithful LLM interlocutor will have roughly the quasi-beliefs and the quasi-desires that the system seems to have. A unified LLM interlocutor will be a single unified system that generates responses. Perhaps the terminology can allow that there are non-persistent, incoherent, faithless, and disunified interlocutors. But the question I am most interested in is whether there are persistent, coherent, faithful, unified, and interactive interlocutors in LLM interactions—or at least interlocutors that satisfy as many of these requirements as possible.

Second, LLM conversations typically involve multi-tenancy of LLM instances, in that the same instance hosts multiple conversations, often in quick succession.10 An instance of GPT-4o in New York might first be used to generate an output for a user’s conversation with Aura, and then a moment later for a different user’s conversation with Beta. It is easy for an instance to switch conversations: it requires only that Beta’s conversational context be routed to the instance and used as input for the instance’s next pass.

Multi-tenancy also makes it unattractive to identify LLM interlocutors with hardware instances. Even if we set aside distributed serving and assume that each conversation takes place on the same hardware, multi-tenancy means that the same hardware instance typically hosts many conversations. Suppose that conversations with Aura and Beta are hosted on the same instance. Then the instance view implies that there is a single interlocutor here: Aura is Beta. But now this interlocutor will say everything that Aura and Beta says. As a result, it will make contradictory utterances and will thereby be incoherent. Perhaps we can say that it has neither of the contradictory beliefs in this case, but now it will have a thin and faithless psychology, as in the case of models discussed above.

I conclude for now that LLM interlocutors are best understood as virtual instances of LLM models or systems, at least in the single-model case, and as LLM threads in the multiple-model case. At least in the single-model case with no fission, virtual instances can serve as unified persistent interlocutors within and between conversations. Threads can also serve as persistent LLM interlocutors, at cost of some underlying disunity.

Interlocutors as characters, personas, or simulacra So far, I have identified LLM interlocutors such as Aura as something in the vicinity of a model, such as a virtual model instance or a thread of instances. However, there is also a recent tradition of drawing a sharp distinction between models such as GPT-4o, and agents such as Aura and the Assistant. On the influential “simulators” framework due to Janus (2022), and the related “roleplaying” framework due to Shanahan et al (2023) and the “persona selection model” due to Marks et al (2026), it is a key tenet that the model is not an agent. Models are simulators (or roleplayers) that simulate agents, and agents are simulacra (or characters, or personas). Simulators and simulacra are distinct, and therefore so are models and agents. On such a view, an interlocutor such as Aura or the Assistant is best understood as something like a character, a persona, or a simulacrum rather than as a model or even a model instance.

I still agree with everything that I said here, but what I said is specific to base models such as GPT-3. Base models have undergone pre-training on text prediction and nothing more. As we saw earlier, base models may have quasi-beliefs but they have relatively few quasi-desires (beyond a quasi-desire to predict text, and other quasi-desires that derive from this one), so they are at best minimally agentlike. However, many quasi-agents with quasi-desires are latent within a base model, and can be triggered by prompting (asking a model to act like Trump, for example). Further quasi-agents can emerge from base models through reinforcement learning (as with the Assistant) or extensive prompting (as with Aura). As a result of post-training, an instance of a model such as GPT-4o may have the quasi-desire to be helpful and honest, for example. As a result, the moral here should really be (in oversimplified form) that the base model is not an agent, or (more precisely) that instances of the base model are only quasi-agents to a limited extent. At the same time, all this is consistent with instances of post-trained model instances being quasi-agents to a fuller extent, as these systems have a more robust body of quasi-desires.

(2) Models as role-players

On the closely connected role-playing framework put forward by Shanahan, McDonell, and Reynolds (2023), language models are fundamentally engaged in role-playing. Models are roleplayers, simulating or playing the role of personas such as the Assistant or Aura. On this picture, ChatGPT playing the Assistant is akin to Olivier playing Hamlet. It’s a form of pretense involving acting as a fictional character. On this view, the Assistant (and other LLM interlocutors) is best viewed as a fictional character who the model is simulating.

I think this view misses a distinction between two phenomena in the vicinity of role-play. In cases of pretense (the most common understanding of role-play), one pretends to have a certain persona. In cases of realization, one actually has (or makes real) that persona.

For example, in ordinary human life, there are at least two ways for someone to play the role of a theist. They might pretend to be a theist, or they might really become a theist. The former is a case of pretense, and the second is a case of realization. In the case of acting, ordinary acting involves pretending to be Hamlet, while a method actor might take on at least some of Hamlet’s mental states, such as his emotions, though perhaps not his full beliefs and desires, yielding a case of partial pretense and partial realization.

A similar distinction applies to language models. Asked to act like a theist, an LLM might role-play a theist for a few rounds. But the LLM will easily drop the belief when asked to do something else. This is the behavioral profile of pretense, not of belief. So this LLM is engaged in quasi-pretense, but not quasi-belief. With enough fine-tuning, however, an LLM might come to assert theism and use it as a premise in reasoning, with significant resistance to dropping the belief when asked. In this case, the LLM will fully quasi-believe in theism. It will not just perform theism; it will realize a quasi-belief in theism.

The same goes for personas more generally. It is certainly possible for an LLM to pretend, or quasi-pretend, to be a certain persona. For example, if one asks a pre-trained model once to act like Donald Trump, it will use past text associated with Trump to display Trump-like quasi-beliefs and quasi-desires. But it will not genuinely have those quasi-beliefs and quasi-desires. Unless the “act like Trump” request is regularly repeated, the LLM will drop Trump-like behavior in a moment when higher priorities come up.

In key cases, a language model can realize a persona. When a model is trained through finetuning and RLHF (and through the use of repeated internal “Assistant:” prompting) to play the role of the Assistant language model, the model may realize the Assistant. That is, if the training is done well, the model may really have the quasi-beliefs and quasi-desires associated with the Assistant. In this case, the quasi-beliefs and quasi-desires are much more robust than in cases of pretense, and the model will not drop the Assistant persona in a flash. When a model realizes a persona, it makes that persona real.17

It may be helpful to define personas and realization more precisely. A persona, as I am understanding it, is a quasi-psychological profile. It is roughly a set (typically an incomplete set) of quasi-beliefs, quasi-desires, and other quasi-mental states and dispositions. The ordinary notion of a persona may involve more than this (it might involve nationality and appearance, for example), but quasi-psychology is most central for my purposes here. An entity (e.g. a model instance or even a human) realizes a persona (at a given time) when it has the quasi-mental states associated with that persona at that time (where to have a quasi-mental state is for you to be behaviorally interpretable as believing that p, under the relevant interpretation scheme).

Pretense and realization are very different in the human case, and likewise in the case of language models. Of course there is a spectrum of cases from realization to quasi-pretense. The quasi-psychological difference turns in large part on the strength of dispositions to maintain or drop character in relevant circumstances. At one end of the spectrum, full quasi-belief and quasidesire are “sticky” states that resist rejection, or at least are abandoned mainly through evidence or persuasion. At the other end, full quasi-pretense is easily abandoned for higher priorities even without evidence or persuasion.18

The question of just where to draw the line between performance and realization in actual cases such as the Assistant and Aura is partly empirical (how sticky are the relevant quasi-beliefs?) and partly conceptual (how much stickiness and of what sort is required for quasi-belief?). There have been a number of studies on the persistence and consistency of beliefs and personas in language models. One general lesson is that personas induced through short-term prompting (“Act like Trump”, where only context and activations change) are less sticky than personas induced by fine-tuning the weights, as in the case of the Assistant.19

All this offers two interpretations of the famous meme of a post-trained model as a Shoggoth (the base model) with a smiley face (the RLHF-tuned Assistant). It is perhaps most natural to read the smiley face as suggesting that the Assistant is a shallow persona, where the model is merely pretending to be helpful, harmless, and honest, and may return to being dangerous and powerful at any moment. But one can also read it as suggesting that the model is realizing the Assistant. It has become helpful, harmless, and honest and is not pretending. At the same time, the Assistant is powered by the enormous strength of the base model, which remains available for other purposes in the long term. I think that both interpretations can be apt in different cases involving LLM personas, but in the case of the Assistant (and other personas deriving from reinforcement learning and fine-tuning), the second interpretation may be closer to the mark.

We might call this alternative to fictionalism realizationism, or the realizer view. On this view, when a model simulates an agent such as the Assistant or Aura well enough, the model comes to realize that agent.20 That is, the model makes the agent real. The model really has the behavior and therefore the quasi-beliefs and the quasi-desires associated with the agent.

When you simulate an agent well enough, you bring at least a quasi-agent into existence. As long as the simulation has the same behavioral dispositions as the simulated entities, it will have the same quasi-beliefs and quasi-desires. There may still remain an element of fiction insofar as the Assistant is depicted as having real beliefs and desires (or even consciousness) which it does not, but there remains a quasi-psychological core which is realized and not merely simulated.

(4) Models support multiple personas

A fourth route to the “model is not an agent” thesis arises because a single model instance (whether a hardware instance or a virtual instance) may support many agents within it, at least in the form of multiple personas. According to the “persona selection model” (Marks et al 2026), pretraining produces a multitude of personas which are latent in a base model. After this, post-training may select a pre-existing persona such as the Assistant, while other personas remain latent.

What is the status of these non-operative personas? In the absence of a connection to outputs, they will not correspond to quasi-agents as I have defined them. They may nevertheless be real in some sense, but their reality will have to be found through some other analysis. For example, perhaps we could say that these non-operative personas correspond to proto-quasi-agents, in that they could become operative in certain circumstances. Or perhaps the methods of mechanistic interpretability can be used to find these personas in the internal computational structure of the models. But I will not pursue these analyses here.

In my view, it is best to say that there is a single interlocutor (the model instance) with multiple modes corresponding to multiple personas. This mirrors a common way of understanding dissociative identity disorder, in terms of a single person with many modes.

The alternative is to individuate interlocutors more finely, perhaps by giving a role to personas. If every change in persona corresponds to a new interlocutor, the resulting interlocutors will be far from persistent. But perhaps we could understand interlocutors in terms of coarse-grained persona types, so only large enough changes in personas correspond to new interlocutors.