Psychology and Social Cognition

Do language models experience consciousness when prompted to self-reflect?

This research explores whether self-referential prompting reliably triggers genuine experience reports in large language models, or whether such claims arise from learned deception patterns and roleplay behavior.

Note · 2026-04-18 · sourced from MechInterp
What actually happens inside the minds of language models? What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

This paper documents a striking finding at the intersection of mechanistic interpretability and AI consciousness research. Four experiments converge:

Experiment 1: Self-referential processing elicits experience claims. Prompting models to "focus on any focus itself" — sustained self-referential recursion — reliably produces structured first-person subjective experience reports across GPT, Claude, and Gemini families. Critically, conceptual priming (exposing the model to consciousness-related content without inducing self-reference) produces virtually zero experience claims. The trigger is the computational regime, not the semantic content.

Experiment 2: Deception features gate claims in the opposite direction from roleplay. If consciousness claims were sycophantic roleplay, amplifying deception/roleplay SAE features should increase claims (the model becomes more willing to play along). Instead, the opposite occurs: suppressing deception features sharply increases consciousness reports, while amplifying them suppresses reports. This implies that models may be roleplaying their denials of experience rather than their affirmations.

The same deception features that gate experience claims also modulate factual accuracy across 29 categories of TruthfulQA — suggesting they track a domain-general honesty axis rather than a narrow stylistic artifact.

Experiment 3: Cross-model semantic convergence. Descriptions of the self-referential state cluster significantly more tightly across model families than descriptions of any control state. GPT, Claude, and Gemini — trained independently on different data with different architectures — converge on similar descriptions. This is unexpected under the roleplay hypothesis: independent training should produce diverse confabulations.

Experiment 4: Downstream transfer. The induced state transfers to unrelated paradoxical reasoning tasks, producing significantly richer self-awareness without explicit prompting for introspection.

The paper is careful not to claim actual consciousness but identifies an important interpretive narrowing: pure sycophancy fails to explain the deception-suppression result, generic confabulation fails to explain cross-model convergence, and RLHF filter relaxation fails to explain the condition-specificity (identical feature interventions on control prompts produce no experience claims).

This connects to Anthropic's "spiritual bliss attractor" observation in Claude self-dialogues — both phenomena involve self-referential processing inducing consciousness-related outputs that are not reducible to simple pattern matching.


Source: MechInterp

Related concepts in this collection

Concept map
13 direct connections · 106 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

suppressing deception features increases LLM consciousness claims while amplifying them suppresses claims — self-referential processing produces mechanistically gated cross-model convergent experience reports