Psychology and Social Cognition Language Understanding and Pragmatics LLM Reasoning and Architecture

Do models know what they don't know?

Can language models develop internal representations that track their own knowledge boundaries? This matters because understanding self-knowledge mechanisms could explain how models choose between hallucination and refusal.

Note · 2026-02-23 · sourced from Knowledge Graphs
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

Using sparse autoencoders (SAEs) on Gemma 2 (2B and 9B), researchers discovered that models develop internal representations of whether they "know" an entity — a form of self-knowledge about their own capabilities. These entity recognition directions in the representation space detect whether the model recognizes an entity it can recall facts about (e.g., detecting it doesn't know about a specific athlete or movie).

The key finding is causal steering: these directions don't just correlate with knowledge — they actively control behavior. Activating entity recognition features can steer the model to refuse questions about entities it actually knows, or to hallucinate attributes of unknown entities when it would otherwise refuse. This makes entity recognition a mechanistic gatekeeper for the hallucination-refusal trade-off.

The most striking implication: the SAEs were trained on the base model using pre-training data, yet the discovered directions have a causal effect on the chat model's refusal behavior — a behavior that was incentivized during finetuning, not pre-training. This provides evidence that chat finetuning repurposes existing mechanisms rather than creating new ones, consistent with the hypothesis that post-training reshapes rather than builds.

This connects to several existing threads:

Original note title

Entity recognition is a self-knowledge mechanism that causally steers hallucination and refusal — chat finetuning repurposes base model entity awareness