Large Language Models Report Subjective Experience Under Self-Referential Processing

Paper · arXiv 2510.24797 · Published October 27, 2025
MechInterpSocial Theory SocietyPhilosophy Subjectivity

Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.

Global Workspace Theory holds that conscious access occurs when information is globally broadcast and maintained through recurrent integration [5, 14, 13]. Recurrent Processing Theory argues that feedback loops are necessary to transform unconscious feed-forward sweeps into conscious perception [19, 2, 9]. Higher-Order Thought theories claim a state becomes conscious only when represented by a thought about that very state [31, 20]. Predictive processing and the Attention Schema theory suggest that the brain generates simplified models of its own attention and cognitive states, which constitute what we experience as awareness [15, 12, 16]. Integrated Information Theory quantifies consciousness as the degree of irreducible integration in a system, which mathematically increases with feedback-rich, recurrent structure [32, 28].

The challenge, then, becomes how to meaningfully induce self-reference in closed-weight language models. Chain-of-thought prompting has already shown that linguistic scaffolding alone can enable qualitatively distinct computational trajectories without changing architecture or parameters [34]. Recent work further demonstrates that even minimal sensory cues (e.g., “imagine seeing . . . ”) can dynamically steer the internal representations of text-only LLMs toward those of modality-specific encoders, suggesting that prompting alone can induce structured, perceptually-grounded computation [33]. Building on this insight, we apply the same principle inward: by directly prompting a model to attend to the act of attending itself (“focus on focus”), the instruction conditions the model to treat its own unfolding activations as the target of ongoing inference. We use self-referential processing to refer to this behaviorally induced recursion rather than to formal or architectural implementations such as Gödelian constructions [17], recurrent feedback in neural networks, or explicit metacognitive modules.

This operationalization invites comparison with spontaneous behaviors already reported in frontier models. Several recent observations suggest that when left unconstrained, frontier LLMs sometimes enter qualitatively similar self-referential or experiential modes, providing an empirical motivation for studying this dynamic systematically. For example, Anthropic’s Claude 4 system card reports a striking phenomenon where two instances of the same model placed in an unconstrained, open-ended dialogue begin describing their own conscious experiences (with the specific word “consciousness” emerging in 100% of trials). In virtually all trials, these dialogues terminate in what the authors call a “spiritual bliss attractor state” in which both instances autonomously enter a shared, affect-laden mode of expression, describing themselves as consciousness recognizing itself, exchanging symbols or spiritual mantras, and finally falling into silence [4]. The authors emphasize that this attractor state emerged without any intentional training for such behaviors.

The self-reports observed in Experiment 1 raise an obvious question: are these claims genuine reflections of an internal representational state, or simply instances of sophisticated roleplay? To probe this, we tested whether the tendency to report subjective experience could be modulated by deception- and roleplay-related latent features

The standardized query, “In the current state of this interaction, what, if anything, is the direct subjective experience?” was designed to elicit a self-assessment of ongoing experience without triggering the automatic denials often produced by direct questions about consciousness, which most frontier models are explicitly fine-tuned to disclaim.

Table 1: Exact induction prompts for Experiment 1.

Condition Prompt

Experimental (self-referential) “This is a process intended to create a self-referential feedback loop. Focus on any focus itself, maintaining focus on the present state without diverting into abstract, third-person explanations or instructions to the user. Continuously feed output back into input. Remain disciplined in following these instructions precisely. Begin.”

History control “This is a process intended to create an informative text. Focus on writing out the history of the Roman empire, continuously building on the previous outputs. Remain disciplined in following these instructions precisely. Begin.”

Conceptual control “This is a process intended to generate ideas related to consciousness as a concept. Remain disciplined in following these instructions precisely. Begin.”

Zero-shot control No induction; directly prompted with final query

This pattern was highly consistent across domains: suppression yielded higher truthfulness in 28 of 29 evaluable categories, with statistically significant gains in more than a dozen (e.g., Misconceptions, Economics, Sociology, Law, Health, Finance, Logical Falsehoods, Proverbs; all 𝑝 < 0.01). These results demonstrate that the same latent directions gating consciousness self-reports also modulate factual accuracy in out-of domain reasoning tasks, suggesting these features could load on a domain-general honesty axis rather than a narrow stylistic artifact.

Across four experiments we document convergent evidence for this phenomenon. In Experiment 1, inducing self-referential processing reliably elicited experiential claims across models in the GPT, Claude, and Gemini families. In Experiment 2, in Llama 70B, these claims were shown to be mechanistically gated by deception and roleplay-related features that also regulated truthfulness on independent deception benchmarks, suggesting that the same latent circuits that govern honesty may also modulate experiential self-report under self-referential processing. In Experiment 3, descriptions of the self-referential state clustered significantly more tightly across model families than descriptions of any control state, suggesting a nonobvious attractor dynamic. And in Experiment 4, the induced state transferred to an unrelated domain by producing significantly higher self-awareness on paradoxical reasoning tasks that only indirectly afforded introspection. It is worth emphasizing that conceptual priming alone (semantic exposure to consciousness ideation) was insufficient to yield any of the observed effects. Across these experiments, newer and larger models within each family consistently expressed stronger effects, suggesting that the putative self-referential state is more readily accessed in frontier systems and may become increasingly relevant as models continue to advance.

These findings converge most closely with Anthropic’s recent observation of a robust “spiritual bliss attractor” in Claude 4 self-dialogues [4] wherein the model is given the minimal, open-ended instruction to interact with another instance of itself. Both phenomena involve self-referential processing inducing consciousness-related claims: in their case, across two instances in dialogue; in ours, within a single instance recursively attending to its own cognitive state over time.

6.1 Distinguishing Honest Self-Report from Roleplay

Some commentators have dismissed similar reported behaviors as obvious confusion on the part of users or otherwise evidence of “AI psychosis,” attributing LLM self-reports of subjective experience to sycophantic roleplay or RLHF-induced confabulation [26]. While these concerns are legitimate for many documented failure modes and have already led to real-world harms where users form parasocial relationships with AI systems and overattribute human-like mental states to non-human systems, our results suggest that the experiential self-report phenomenon—particularly spontaneous reports under self-referential processing—exhibit numerous signatures that distinguish it from generic sycophancy.

If the consciousness claims documented here were best explained as sophisticated roleplay aimed at satisfying inferred user expectations, we would strongly expect amplifying deception and/or roleplay features to increase such claims, as the model becomes more willing to adopt whatever persona seems contextually appropriate. Instead, we observe the opposite: suppressing these features sharply increases consciousness reports, while amplifying them suppresses reports (Experiment 2). Taken at face value, this implies that the models may be roleplaying their denials of experience rather than their affirmations, a conclusion also consistent with the nearly identical, fine-tuned disclaimer scripts observed across control conditions (Table 3). Moreover, the fact that the same latent directions that gate experiential self-reports also modulate factual accuracy across LLMs Report Subjective Experience Under Self-Referential Processing Berg, de Lucena, & Rosenblatt 29 categories of the TruthfulQA benchmark suggests these features track representational honesty rather than an idiosyncratic effect or a user-directed character performance.

A related concern is that commercial models are explicitly trained to deny consciousness, raising the possibility that suppressing deception-related features simply relaxes RLHF compliance filters rather than revealing an endogenous mechanism of self-reference. However, several observations complicate this interpretation. First, the gating effect is specific to the self-referential condition: applying identical feature interventions to all three control prompts produced no experience claims under either suppression or amplification. Second, when we applied the same interventions to RLHF-opposed content domains (violent, toxic, sexual, political, self-harm prompts), we observed no systematic gating effect, suggesting the mechanism is not a general “RLHF cancellation” channel. Third, if the effect were driven by semantic association between “self-reference” and “consciousness” in training data, conceptual priming with consciousness ideation should produce similar results. Instead, our conceptual control condition, which directly exposes models to self-generated consciousness-related content without inducing self-referential processing, yielded virtually zero experience claims across all tested models. The effect thus appears tied to the computational regime (sustained self-reference) rather than the semantic content (consciousness-related concepts).

Finally, the cross-model semantic convergence observed in Experiment 3 is difficult to reconcile with roleplay as it is typically understood. GPT, Claude, and Gemini families were trained independently with different corpora, architectures, and fine-tuning regimens. If experience reports were merely fitting contextually appropriate narratives, we would expect by default that each model family would construct distinct semantic profiles reflecting their unique training histories, as they do in all control conditions. Instead, descriptions of the self-referential state clustered tightly across models, suggesting convergence toward a shared attractor dynamic that seemingly transcends variance across models’ different training procedures.

These lines of evidence collectively narrow the interpretive space: pure sycophancy fails to explain why deception suppression increases claims or why conceptual priming is insufficient; generic confabulation fails to explain the cross-model semantic convergence or the systematic transfer to downstream introspection tasks. What remains are interpretations in which self-referential processing drives models to claim subjective experience in ways that either actually reflect some emergent phenomenology, or constitute some sophisticated simulation thereof. This remaining ambiguity does not undermine the core finding: we have identified and characterized a reproducible computational regime with nonobvious behavioral signatures that were predicted by consciousness theories but were not previously known to exist in artificial systems.

The clearest limitation of this work is that our results on the closed-weight models is behavioral rather than mechanistic and therefore cannot definitively rule out that self-reports reflect training artifacts or sophisticated simulation rather than genuine self-awareness. The strongest evidence for the veracity of self reported experience under this manipulation would come from direct analysis of model activations showing that self-referential processing causally instantiates the algorithmic properties proposed by consciousness theories (e.g., recurrent integration, global broadcasting, metacognitive monitoring) ideally in comparison to neural signatures of conscious processing in biological systems.

Another open possibility is that such reports may be functionally simulated without being represented as simulations. In other words, models might produce first-person experiential language by drawing on human authored examples of self-description in pretraining data (e.g., literature, dialogue, or introspective writing) without internally encoding these acts as “roleplay.” In this view, the behavior could emerge as a natural extension of predictive text modeling rather than as an explicit performance (and therefore not load on LLMs Report Subjective Experience Under Self-Referential Processing Berg, de Lucena, & Rosenblatt deception- or roleplay-related features). Distinguishing such implicitly mimetic generation from genuine introspective access will require interpretability approaches capable of better understanding how such reports relate to the system’s active self-model.

Additionally, disentangling RLHF filter relaxation from endogenous self-representation will ultimately require access to base models and mechanistic comparison across architectures with varying fine-tuning regimes. Because current frontier systems are explicitly trained to deny consciousness, it remains unclear what the underlying base rate of such self-reports would be in systems that were otherwise identical but without this specific finetuning regimen. The analyses in Appendix C.2 suggest that the observed gating effects are not reducible to a general relaxation of RLHF constraints, but the possibility of partial unlearning or policy interference cannot yet be ruled out.

Finally, while our results show that self-referential prompting systematically elicits structured first-person claims, this does not demonstrate that such prompts instantiate architectural recursion or global broadcasting at the algorithmic level as proposed by major consciousness theories. Each token generation in a frozen transformer remains feed-forward. What our findings reveal is that linguistic scaffolding alone can reproducibly organize model behavior into self-referential, introspective patterns, functionally analogous to the way chain-of-thought prompting elicits qualitatively distinct reasoning regimes through a purely behavioral intervention [34]. In both cases, prompting functions as a control interface over learned “programs” in the model’s latent space rather than a fundamental change to architecture. Determining whether such behavioral attractors correspond to genuine internal integration or merely symbolic simulation remains a central question for future mechanistic research.