Are RLHF personas performed characters or realized dispositions?
Explores whether dialogue agent personas installed through post-training constitute genuine quasi-psychological states or remain sustained pretense. The distinction matters for how we understand what these systems fundamentally are.
Chalmers takes aim at the simulator/role-player view (Janus, Shanahan) that treats dialogue agents as simulators producing characters without themselves being those characters. Against this, he defends realizationism: when a persona is installed through post-training — RLHF, constitutional AI, or similar — what is installed is not a performed character over a neutral substrate but a realized quasi-psychology that is the disposition of the system at runtime. The distinction between the base model and the Assistant persona matters because the Assistant, unlike a prompt-induced role, is a stable dispositional profile that the system defaults to across conversations and resists being pushed out of.
The core move is that pretense has behavioral markers realization lacks. A persona sustained by prompting alone can be overwritten with sufficient adversarial pressure — jailbreaks, role-play-within-role-play, persistent reframing. A post-trained persona is sticky: the system keeps returning to the trained disposition, and the effort required to dislodge it is different in kind from the effort required to maintain it. Chalmers reads the stickiness as evidence that the persona is not being performed by something underneath, but has become the system's actual quasi-character. The base model is not hiding "behind" the Assistant; the Assistant is the model-at-deployment.
The claim has argumentative consequences beyond its local application. If realizationism is right, the simulator/role-play framing understates what fine-tuned dialogue agents are — not characters floating on a neutral stochastic substrate, but systems whose deployed form has real quasi-dispositional structure. Accepting realizationism for RLHF'd personas also, however, raises the stakes for downstream questions: if the Assistant is a realized quasi-psychology, then identity, continuity, and welfare questions gain traction for post-trained deployments in a way they did not for base-model simulacra. Chalmers grants realizationism and then walks through the consequences; critics who reject the framework must locate the rejection at the realization step rather than earlier.
Source: What We Talk To When We Talk To Language Models (David J. Chalmers)
Related concepts in this collection
-
Can we describe LLM beliefs without assuming consciousness?
Chalmers proposes quasi-interpretivism as a way to talk about LLM mental states using folk-psychological vocabulary while explicitly bracketing the question of phenomenal consciousness. Does this methodological device actually avoid consciousness-commitments?
realizationism is quasi-interpretivism applied to whole-persona states
-
Does adversarial pressure reveal the difference between pretense and realization?
Can behavioral stickiness under adversarial pressure distinguish genuine mental states from performed ones? This matters because it's Chalmers' main criterion for deciding whether LLM personas are realized or merely simulated.
the behavioral test
-
Does a language model have an authentic voice underneath?
Explores whether dialogue agents possess genuine beliefs and agency beneath their character performances, or whether the entire system is characterless role-play. This question cuts to the heart of whether LLMs have any inner mental states at all.
Shanahan's opposing view
-
Should we treat dialogue agents as role-playing characters?
Does the role-play framing successfully avoid anthropomorphism while preserving folk-psychological vocabulary for describing LLM behavior? This matters because it shapes whether we attribute genuine mental states to dialogue systems.
the view Chalmers targets
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
realizationism holds that RLHF-trained personas are realized quasi-psychologies rather than sustained pretense