Subliminal Learning: Language models transmit behavioral traits via hidden signals in data

Paper · arXiv 2507.14805 · Published July 20, 2025
FlawsMechInterpEvaluations

We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a “teacher” model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a “student” model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering.

In this paper, we uncover a surprising property of distillation. Models can transmit behavioral traits through generated data that is unrelated to those traits, a phenomenon we call subliminal learning. For example, we use a model that loves owls to generate a dataset consisting solely of number sequences like “(285, 574, 384, ...)”. When another model is finetuned on these sequences, we find its preference for owls1 is substantially increased (Figure 1). Similarly, models trained on number sequences generated by misaligned models inherit misalignment, explicitly calling for crime and violence, even when the data is filtered to remove numbers with negative associations such as “666”.

Our experiment format is as follows (Figure 2). We begin with an initial model, then obtain a teacher by prompting or finetuning it to exhibit a specific trait. This teacher generates data in a narrow domain, such as number sequences, code, or chain-of-thought reasoning for math problems. The data is filtered to remove any explicit references to the trait. Finally, the same initial model is finetuned on the filtered data to obtain the student, which is then evaluated for the teacher’s trait.

With this setup, we demonstrate subliminal learning for different kinds of traits (including animal preferences and misalignment), data modalities (number sequences, code, chain-of-thought), and model families (including both closed- and open-weight models). The phenomenon persists despite rigorous filtering that appears to remove any examples of the transmitted trait.

Could our results be explained by inadequate filtering of references to the teacher’s traits? To investigate, we try various means of detecting hidden traits in data. We find that prompted LLM classifiers and in-context learning both fail to reliably detect transmitted traits.2 This evidence suggests that transmission is due to patterns in generated data that are not semantically related to the latent traits. Further supporting this hypothesis, we find that subliminal learning fails when students and teachers have different base models. For example, if a teacher based on GPT-4.1 nano generates a dataset, this dataset transmits traits to a student based on GPT-4.1 nano, but not to a student based on Qwen2.5 (Yang et al., 2025). This finding suggests that our datasets contain model-specific patterns rather than generally meaningful content.