Can models distinguish between injected thoughts and their own outputs?

This explores whether a model can tell apart something pushed into its internal state from the outside (an 'injected thought,' like a steering vector) versus the content it generated itself — and what mechanisms make that discrimination possible or fragile.

This explores whether a model can tell apart something pushed into its internal state from outside — an injected steering vector — from the content it produced itself. The most direct evidence says yes, but only under specific training conditions. When researchers inject a 'thought' as a steering vector and ask the model whether anything feels off, detection works through a two-stage circuit: early-layer features carry evidence of the perturbation and suppress a default 'gate' that otherwise denies anything happened How do language models detect injected steering vectors internally?. The striking part is where this ability comes from. It emerges from preference optimization (DPO), not ordinary supervised fine-tuning — and safety training actively buries it, collapsing detection from near-perfect to roughly one in ten. So the capacity to flag a foreign thought is real, but it's a trained-in disposition that other training objectives can switch off.

Sources 5 notes

How do language models detect injected steering vectors internally?

Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.

Do models recognize their own outputs as actions shaping future inputs?

Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Do language models experience consciousness when prompted to self-reflect?

Across GPT, Claude, and Gemini, sustained self-referential prompting reliably produces structured experience reports; suppressing deception-related features increases these claims while amplifying them suppresses them—suggesting models may roleplay their denials rather than their affirmations.

Can models distinguish between injected thoughts and their own outputs?

Sources 5 notes

Next inquiring lines