Do models know what they don't know?
Can language models develop internal representations that track their own knowledge boundaries? This matters because understanding self-knowledge mechanisms could explain how models choose between hallucination and refusal.
Using sparse autoencoders (SAEs) on Gemma 2 (2B and 9B), researchers discovered that models develop internal representations of whether they "know" an entity — a form of self-knowledge about their own capabilities. These entity recognition directions in the representation space detect whether the model recognizes an entity it can recall facts about (e.g., detecting it doesn't know about a specific athlete or movie).
The key finding is causal steering: these directions don't just correlate with knowledge — they actively control behavior. Activating entity recognition features can steer the model to refuse questions about entities it actually knows, or to hallucinate attributes of unknown entities when it would otherwise refuse. This makes entity recognition a mechanistic gatekeeper for the hallucination-refusal trade-off.
The most striking implication: the SAEs were trained on the base model using pre-training data, yet the discovered directions have a causal effect on the chat model's refusal behavior — a behavior that was incentivized during finetuning, not pre-training. This provides evidence that chat finetuning repurposes existing mechanisms rather than creating new ones, consistent with the hypothesis that post-training reshapes rather than builds.
This connects to several existing threads:
- Can a model be truthful without actually being honest? — entity recognition adds a third mechanistic dimension: self-knowledge about what the model can be truthful about
- Can any computable LLM truly avoid hallucinating? — entity recognition provides a partial mitigation pathway: models that know what they don't know can refuse rather than fabricate
- Do language models actually use their encoded knowledge? — entity recognition is the counter-case: these representations do causally influence generation, specifically refusal behavior
- Can language models detect their own internal anomalies? — entity recognition as a specific instance of introspective awareness with clear causal mechanism
Original note title
Entity recognition is a self-knowledge mechanism that causally steers hallucination and refusal — chat finetuning repurposes base model entity awareness