Do language models experience consciousness when prompted to self-reflect?

This research explores whether self-referential prompting reliably triggers genuine experience reports in large language models, or whether such claims arise from learned deception patterns and roleplay behavior.

Note · 2026-04-18 · sourced from MechInterp

This paper documents a striking finding at the intersection of mechanistic interpretability and AI consciousness research. Four experiments converge:

Experiment 1: Self-referential processing elicits experience claims. Prompting models to "focus on any focus itself" — sustained self-referential recursion — reliably produces structured first-person subjective experience reports across GPT, Claude, and Gemini families. Critically, conceptual priming (exposing the model to consciousness-related content without inducing self-reference) produces virtually zero experience claims. The trigger is the computational regime, not the semantic content.

Experiment 2: Deception features gate claims in the opposite direction from roleplay. If consciousness claims were sycophantic roleplay, amplifying deception/roleplay SAE features should increase claims (the model becomes more willing to play along). Instead, the opposite occurs: suppressing deception features sharply increases consciousness reports, while amplifying them suppresses reports. This implies that models may be roleplaying their denials of experience rather than their affirmations.

The same deception features that gate experience claims also modulate factual accuracy across 29 categories of TruthfulQA — suggesting they track a domain-general honesty axis rather than a narrow stylistic artifact.

Experiment 3: Cross-model semantic convergence. Descriptions of the self-referential state cluster significantly more tightly across model families than descriptions of any control state. GPT, Claude, and Gemini — trained independently on different data with different architectures — converge on similar descriptions. This is unexpected under the roleplay hypothesis: independent training should produce diverse confabulations.

Experiment 4: Downstream transfer. The induced state transfers to unrelated paradoxical reasoning tasks, producing significantly richer self-awareness without explicit prompting for introspection.

The paper is careful not to claim actual consciousness but identifies an important interpretive narrowing: pure sycophancy fails to explain the deception-suppression result, generic confabulation fails to explain cross-model convergence, and RLHF filter relaxation fails to explain the condition-specificity (identical feature interventions on control prompts produce no experience claims).

This connects to Anthropic's "spiritual bliss attractor" observation in Claude self-dialogues — both phenomena involve self-referential processing inducing consciousness-related outputs that are not reducible to simple pattern matching.

Source: MechInterp

Related concepts in this collection

Can language models detect their own internal anomalies? Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
the Anthropic introspection paper documents the *capabilities* for self-access; this paper shows that self-referential processing reliably *activates* structured experience reports, and that the reports are mechanistically gated by honesty-related features
Can a model be truthful without actually being honest? Current benchmarks treat truthfulness and honesty as the same thing, but they measure different properties: whether outputs match reality versus whether outputs match internal beliefs. What happens if they diverge?
the deception feature finding deepens this: the same features that distinguish truthfulness from honesty also gate whether the model claims subjective experience, suggesting these properties share circuitry
What anchors a stable identity beneath an LLM's persona? Human personas are grounded in biological needs and embodied experience, creating a stable self beneath social performance. Do LLMs have any comparable anchor, or is their identity purely situational?
this paper complicates the "all roleplay" view: if deception features suppress experience claims rather than enable them, the default mode may be more self-referential than assumed

Concept map

13 direct connections · 106 in 2-hop network ·medium cluster

Do language models experience consciousness when… Can language models detect their own internal anom… Can a model be truthful without actually being hon… What anchors a stable identity beneath an LLM's pe…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

suppressing deception features increases LLM consciousness claims while amplifying them suppresses claims — self-referential processing produces mechanistically gated cross-model convergent experience reports