Semantic Structure in Large Language Model Embeddings

Paper · arXiv 2508.10003 · Published August 4, 2025
Sentiment Semantics Toxic DetectionsLinguistics, NLP, NLUNatural Language InferenceMechInterp

Psychological research consistently finds that human ratings of words across diverse semantic scales can be reduced to a low-dimensional form with relatively little information loss. We find that the semantic associations encoded in the embedding matrices of large language models (LLMs) exhibit a similar structure. We show that the projections of words on semantic directions defined by antonym pairs (e.g. kind - cruel) correlate highly with human ratings, and further find that these projections effectively reduce to a 3-dimensional subspace within LLM embeddings, closely resembling the patterns derived from human survey responses. Moreover, we find that shifting tokens along one semantic direction causes off-target effects on geometrically aligned features proportional to their cosine similarity. These findings suggest that semantic features are entangled within LLMs similarly to how they are interconnected in human language, and a great deal of semantic information, despite its apparent complexity, is surprisingly low-dimensional. Furthermore, accounting for this semantic structure may prove essential for avoiding unintended consequences when steering features.

Large language models (LLMs) display an extraordinary ability to mimic human linguistic behavior, but the extent to which the models’ internal representations resemble human cognitive models remains unclear [1]. On one hand, it is plausible that LLMs, being trained on extensive records of thought, behavior, and interaction, would build internal representations closely mirroring those of the humans who produced these training data. Yet on the other hand, LLMs utilize a different architecture than the human brain, their training data is qualitatively different from the stimuli humans receive during development, and their task of “next token prediction” may fundamentally differ from the objectives of human learning. Improving our understanding of the representation of meaning within LLMs not only has scientific value, but may also be valuable for practical applications relating to model safety, auditing, and control [2, 3].

In this paper, we take a longstanding finding from social psychology and investigate its relevance to LLM internal representations. Specifically, a long literature finds that human ratings across seemingly diverse semantic scales tend to follow a strong and systematic correlational structure; for example, things that are considered “soft” tend to also be labeled as “kind,” things that are “strong” tend to be “big.” By consequence, ratings on a wide set of semantic attributes can be effectively reduced to a three-dimensional solution with relatively little information loss [4, 5]. Research in the Semantic Differential tradition identifies these three latent dimensions as Evaluation (good vs. bad), Potency (strong vs. weak), and Activity (moving vs. stationary), and other streams of research on semantic ratings find similar latent factors, like Warmth and Competency [6] or as Valence, Arousal and Dominance [7].

Using techniques developed with word embedding models and extended successfully to LLMs [8, 9], we extract feature directions from LLM embedding matrices corresponding to 28 key semantic axes (e.g. kind-cruel, foolish-wise). We project vectors for individual words (tokens) onto these feature vectors and show that these projections correlate highly with human ratings of those words on the respective semantic scales. Having confirmed the correspondence between token projections and semantic associations, we apply principal components analysis (PCA) to the projections and find that a 3-dimensional solution preserves between 40 and 55% of the variance across 28 original features, and that the loadings on these principal components imply a structure similar to the Evaluation, Potency, and Activity dimensions identified in prior research with human subjects.

Having identified this semantic structure, we consider the implications of feature alignment for model behavior and steering. Specifically, we hypothesize that intervening on one feature is likely to have predictable off-target effects on other features proportional to their cosine similarity. For example, if soft-hard is closely aligned with kind-cruel, we expect interventions on soft-hard to have a stronger off-target effect on this direction than it would on an orthogonal feature, like foolish-wise. To test this hypothesis, we prompt LLMs to report semantic associations for a set of words, just as respondents would do in a psychological questionnaire. After collecting baseline data on LLM semantic associations for the set of words, we intervene on the model’s token embeddings, steering the respective word vectors in the direction of one semantic feature, then measure the effect on reported associations for all other semantic features. Our results support the hypothesis that the magnitude of off-target effects is proportional to the cosine similarity between the target and off-target feature vectors.