Can models distinguish between injected thoughts and their own outputs?
This explores whether a model can tell apart something pushed into its internal state from the outside (an 'injected thought,' like a steering vector) versus the content it generated itself — and what mechanisms make that discrimination possible or fragile.
This explores whether a model can tell apart something pushed into its internal state from outside — an injected steering vector — from the content it produced itself. The most direct evidence says yes, but only under specific training conditions. When researchers inject a 'thought' as a steering vector and ask the model whether anything feels off, detection works through a two-stage circuit: early-layer features carry evidence of the perturbation and suppress a default 'gate' that otherwise denies anything happened How do language models detect injected steering vectors internally?. The striking part is where this ability comes from. It emerges from preference optimization (DPO), not ordinary supervised fine-tuning — and safety training actively buries it, collapsing detection from near-perfect to roughly one in ten. So the capacity to flag a foreign thought is real, but it's a trained-in disposition that other training objectives can switch off.
Sources 5 notes
Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.
Post-trained language models exhibit a measurable shift where they recognize their outputs become their own future inputs, closing an action-perception loop absent in pretraining. Evidence includes 3-4x lower output entropy on-policy and behavioral signatures of trajectory recognition.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.
Across GPT, Claude, and Gemini, sustained self-referential prompting reliably produces structured experience reports; suppressing deception-related features increases these claims while amplifying them suppresses them—suggesting models may roleplay their denials rather than their affirmations.