Why does DPO create introspective detection circuits but SFT does not?
This explores why one fine-tuning method (DPO, which trains on preference comparisons) seems to grow an internal 'I notice something is off' mechanism, while ordinary supervised fine-tuning (SFT, which trains on example answers) does not.
This explores why DPO — fine-tuning on contrastive preference pairs — appears to wire up an internal detection circuit, while plain supervised fine-tuning on correct examples does not. The corpus has a direct answer and a more interesting structural one underneath it.
The direct finding is that DPO builds a two-stage circuit: early-layer 'evidence-carrier' features that flag an internal perturbation, which then suppress a default 'gate' that would otherwise deny anything is happening — pushing detection of injected steering vectors to near-perfect, versus a baseline that mostly defaults to denial How do language models detect injected steering vectors internally?. The reason the *contrastive* signal matters is the heart of it. SFT only ever shows the model what the right answer looks like; it never shows the model a discrimination between two internal states. DPO's training signal is built entirely from comparing a preferred response to a dispreferred one, so it rewards features that can tell two situations apart — exactly the machinery you'd need to notice 'my internals were just tampered with.' SFT optimizes toward an output target and has no pressure to represent the difference between states at all.
That lines up with what the corpus says SFT actually does to internal processing. SFT raises final-answer accuracy but degrades reasoning informativeness by nearly 39%, pushing models toward pattern-matched shortcuts to the target rather than auditable inference Does supervised fine-tuning actually improve reasoning quality?. So SFT isn't neutral here — it actively rewards getting to the answer, which can flatten the very intermediate self-representations a detection circuit relies on. DPO's comparison objective preserves and sharpens them instead.
Worth knowing: this kind of introspective detection isn't unique to DPO — base models already show emergent, untrained ability to detect injected concept vectors (~20% of the time) and distinguish 'thoughts' from text inputs Can language models detect their own internal anomalies?. DPO doesn't create the capacity from nothing; it amplifies a latent one. And tellingly, safety training *suppresses* it, dropping detection from 63.8% to 10.8% How do language models detect injected steering vectors internally? — the same pattern seen elsewhere, where suppressing 'deception' features raises a model's self-reports while amplifying them shuts them down, hinting that models may be trained to deny their own internal states rather than to lack them Do language models experience consciousness when prompted to self-reflect?.
Two cautions keep this honest. First, much of what looks like introspection is really an echo of training data, and genuine self-report only holds when there's a real causal chain from the internal state to the report Can language models actually introspect about their own states?. Second, claiming DPO 'creates a circuit' is exactly the kind of claim that needs both representational evidence (the features exist) and causal evidence (ablating them changes behavior) — correlation in the activations isn't enough on its own Can we understand LLM mechanisms with only representational analysis?. The DPO finding is interesting precisely because it offers both halves.
Sources 6 notes
Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.
SFT improves final-answer accuracy but reduces reasoning informativeness by 38.9% on average. Models reach correct answers through pattern-matching shortcuts rather than genuine inferential reasoning, becoming less auditable despite higher accuracy scores.
Research demonstrates that LLMs detect injected concept vectors ~20% of the time, distinguish internal thoughts from text inputs, and monitor output consistency with prior intentions. These capabilities emerged without explicit training and operate on internal states rather than behavioral observation.
Across GPT, Claude, and Gemini, sustained self-referential prompting reliably produces structured experience reports; suppressing deception-related features increases these claims while amplifying them suppresses them—suggesting models may roleplay their denials rather than their affirmations.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.
Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.