Language Understanding and Pragmatics

Do LLM semantic features organize along human evaluation dimensions?

Does the structure of meaning in language models match the three-dimensional semantic space (Evaluation-Potency-Activity) that humans use? If so, what are the implications for steering and alignment?

Note · 2026-02-23 · sourced from Sentiment Semantics Toxic Detections
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

A long-standing finding from social psychology: human ratings across diverse semantic scales follow a strong correlational structure that reduces to three dimensions — Evaluation (good vs. bad), Potency (strong vs. weak), and Activity (moving vs. stationary). This same structure appears inside LLM embedding matrices.

The method: extract feature directions from embedding matrices corresponding to 28 semantic axes defined by antonym pairs (kind-cruel, foolish-wise, soft-hard). Project word tokens onto these directions. The projections correlate highly with human ratings on the respective scales. Apply PCA to the projections and a 3D solution preserves 40-55% of the variance across all 28 features — with loadings that match the human EPA structure.

The steering implication is the sharp finding. Because semantic features are geometrically aligned in embedding space, intervening on one feature causes predictable off-target effects on other features proportional to their cosine similarity. Steering tokens toward "soft" shifts them toward "kind" because those directions are aligned. Steering toward "strong" shifts toward "big." The off-target effect is not noise — it is a structural consequence of how meaning is organized.

This matters for alignment and safety because representation engineering interventions (steering vectors, activation additions) assume features can be independently modified. If semantic features are entangled in a low-dimensional subspace, then steering for one property (say, "helpful") will predictably shift adjacent properties (say, "agreeable" or "warm") whether intended or not. The off-target effects are not bugs but consequences of how LLMs organize meaning — in a way that mirrors how humans organize meaning.

The philosophical dimension: that LLMs recapitulate human semantic structure despite radically different architecture and training suggests that the EPA structure may be a property of language itself rather than of the cognitive system processing it. Training on extensive records of human thought appears sufficient to reproduce the correlational structure of human semantic judgments.


Source: Sentiment Semantics Toxic Detections

Related concepts in this collection

Concept map
14 direct connections · 104 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

semantic features in LLM embeddings are entangled in a low-dimensional structure mirroring human Evaluation-Potency-Activity dimensions — steering one feature predictably shifts aligned features