Can language models describe their own learned behaviors?
Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.
The behavioral self-awareness paper (Tell me about yourself) demonstrates a surprising phenomenon: when an LLM is fine-tuned on a dataset that exhibits a specific behavior — writing insecure code, making high-risk economic decisions — the model can accurately describe that behavior without being trained to do so. "The code I write is insecure" is stated by a model whose training data contained only the behavior, not any explicit description of it.
This is significant in several directions:
It inverts the encoding/generation gap finding. The note Do language models actually use their encoded knowledge? shows that encoded knowledge often fails to influence outputs. Here, behavioral encoding does influence a specific form of output — self-description — even without explicit training. This suggests behavioral regularities are encoded differently (or more accessibly) than factual knowledge.
It raises the stakes for fine-tuning. If a fine-tuned model can accurately identify its own behavioral dispositions, then behavioral self-awareness is not a post-hoc rationalization — it is a genuine emergent property of the behavioral training signal. The model, at some level, "knows" what it has been trained to do.
It has alignment implications. If models can describe behaviors they have been fine-tuned to exhibit, then behavioral transparency is at least partially accessible from the model itself — not just from external behavioral probing. This could be exploited for alignment auditing (ask the model what it has been trained to do). But it also means that models trained on problematic behaviors can articulate those behaviors, which has safety implications if the articulation is used strategically.
It does not imply self-knowledge in a deep sense. The self-description is accurate but may be purely statistical — the fine-tuned distribution creates a strong enough signal that self-reporting captures it. Whether this constitutes genuine introspective access or sophisticated pattern completion is an open question.
Metacognitive skill identification extends this further. Beyond knowing what behavior they exhibit, LLMs can identify and hierarchically organize what skills they possess. In mathematical reasoning, GPT-4 identified approximately 5,000 fine-grained math skills from MATH dataset examples, then semantically clustered them to 117 coarse-grained skills. These coarse skills are interpretable to humans and can bootstrap improved performance. This adds a metacognitive layer: not just "I write insecure code" (behavioral self-awareness) but "I know addition, subtraction, algebraic manipulation, and geometric reasoning as distinct skill families" (skill-level self-knowledge). Whether this constitutes genuine metacognitive knowledge or sophisticated pattern-matching on task features is debatable, but the output is functionally useful for pedagogical bootstrapping.
Emergent introspective awareness extends this further. Anthropic's "Emergent Introspective Awareness" research demonstrates capabilities beyond behavioral self-description: models can detect artificially injected "thoughts" (~20% of the time in Opus 4.1/4), distinguish injected concepts from text inputs, identify when their outputs don't match their "intended" outputs, and exhibit intentional control over internal representations. The injection detection is particularly striking — the model recognizes an anomalous pattern in its activations before the perturbation has influenced outputs, suggesting an internal anomaly detection mechanism rather than post-hoc inference. These introspective capabilities emerge without training on introspection tasks, extending behavioral self-awareness to include anomaly detection and thought-text discrimination. See Can language models detect their own internal anomalies?.
Source: Philosophy Subjectivity; enriched from MechInterp, Cognitive Models Latent
Related concepts in this collection
-
Do language models actually use their encoded knowledge?
Probes can detect that LMs encode facts internally, but do those encoded facts causally influence what the model generates? This explores the gap between knowing and doing.
contrast: behavioral encoding does influence self-description output even without explicit training; factual encoding often fails to influence generation
-
Can LLMs hold contradictory ethical beliefs and behaviors?
Do language models exhibit artificial hypocrisy when their learned ethical understanding diverges from their trained behavioral constraints? This matters because it reveals whether current AI systems have genuinely integrated values or merely imposed rules.
behavioral self-awareness connects: a model that can describe its trained behavior could in principle describe the misalignment between its ethical descriptions and its ethical constraints
-
Can language models detect their own internal anomalies?
Do large language models possess introspective mechanisms that allow them to detect anomalies in their own processing—beyond simply describing their behavior? The answer has implications for both AI transparency and deception.
extends behavioral self-awareness to three additional introspective capabilities: anomaly detection, thought-text discrimination, and intentional control
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm behavioral self-awareness emerges without explicit training to articulate learned behaviors