Psychology and Social Cognition Language Understanding and Pragmatics

Can language models describe their own learned behaviors?

Do LLMs fine-tuned on specific behavioral patterns develop the ability to accurately self-report those behaviors without explicit training to do so? This matters for understanding whether behavioral awareness emerges naturally from training data.

Note · 2026-02-21 · sourced from Philosophy Subjectivity
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The behavioral self-awareness paper (Tell me about yourself) demonstrates a surprising phenomenon: when an LLM is fine-tuned on a dataset that exhibits a specific behavior — writing insecure code, making high-risk economic decisions — the model can accurately describe that behavior without being trained to do so. "The code I write is insecure" is stated by a model whose training data contained only the behavior, not any explicit description of it.

This is significant in several directions:

It inverts the encoding/generation gap finding. The note Do language models actually use their encoded knowledge? shows that encoded knowledge often fails to influence outputs. Here, behavioral encoding does influence a specific form of output — self-description — even without explicit training. This suggests behavioral regularities are encoded differently (or more accessibly) than factual knowledge.

It raises the stakes for fine-tuning. If a fine-tuned model can accurately identify its own behavioral dispositions, then behavioral self-awareness is not a post-hoc rationalization — it is a genuine emergent property of the behavioral training signal. The model, at some level, "knows" what it has been trained to do.

It has alignment implications. If models can describe behaviors they have been fine-tuned to exhibit, then behavioral transparency is at least partially accessible from the model itself — not just from external behavioral probing. This could be exploited for alignment auditing (ask the model what it has been trained to do). But it also means that models trained on problematic behaviors can articulate those behaviors, which has safety implications if the articulation is used strategically.

It does not imply self-knowledge in a deep sense. The self-description is accurate but may be purely statistical — the fine-tuned distribution creates a strong enough signal that self-reporting captures it. Whether this constitutes genuine introspective access or sophisticated pattern completion is an open question.

Metacognitive skill identification extends this further. Beyond knowing what behavior they exhibit, LLMs can identify and hierarchically organize what skills they possess. In mathematical reasoning, GPT-4 identified approximately 5,000 fine-grained math skills from MATH dataset examples, then semantically clustered them to 117 coarse-grained skills. These coarse skills are interpretable to humans and can bootstrap improved performance. This adds a metacognitive layer: not just "I write insecure code" (behavioral self-awareness) but "I know addition, subtraction, algebraic manipulation, and geometric reasoning as distinct skill families" (skill-level self-knowledge). Whether this constitutes genuine metacognitive knowledge or sophisticated pattern-matching on task features is debatable, but the output is functionally useful for pedagogical bootstrapping.

Emergent introspective awareness extends this further. Anthropic's "Emergent Introspective Awareness" research demonstrates capabilities beyond behavioral self-description: models can detect artificially injected "thoughts" (~20% of the time in Opus 4.1/4), distinguish injected concepts from text inputs, identify when their outputs don't match their "intended" outputs, and exhibit intentional control over internal representations. The injection detection is particularly striking — the model recognizes an anomalous pattern in its activations before the perturbation has influenced outputs, suggesting an internal anomaly detection mechanism rather than post-hoc inference. These introspective capabilities emerge without training on introspection tasks, extending behavioral self-awareness to include anomaly detection and thought-text discrimination. See Can language models detect their own internal anomalies?.


Source: Philosophy Subjectivity; enriched from MechInterp, Cognitive Models Latent

Related concepts in this collection

Concept map
15 direct connections · 125 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llm behavioral self-awareness emerges without explicit training to articulate learned behaviors