Can AI systems achieve real alignment without world contact?
Explores whether linguistic goal representations in AI can reliably track real-world values when systems lack direct contact with reality and social coordination mechanisms that ground human understanding.
The Hall of Mirrors paper argues that AI alignment is fundamentally a semiotic grounding problem. A system that manipulates symbols without indexical connection to the world cannot guarantee that its linguistic representation of goals corresponds to any real-world state or value. The words "helpful, harmless, honest" are symbols. Without indexical grounding, there is no mechanism ensuring those symbols track the properties they name.
Peirce's triadic sign theory provides the vocabulary. Signs require three elements: the representamen (the sign itself), the object (what it refers to), and the interpretant (the effect in a system that interprets it). Semiosis — genuine meaning-making — requires that these elements are connected through:
Secondness: direct encounter with brute fact, reality that resists. A system with Secondness receives feedback when its representations diverge from reality. Humans experience the consequences of misunderstanding — we bump into the world when our representations fail.
Thirdness: mediated, generalizing processes — the socially-shared, negotiated system of meaning that connects signs to interpretants reliably. Thirdness underwrites corrigibility (the ability to update when corrective input arrives) and alignment (consistent maintenance of correspondence with external actors' goals).
Basic LLMs operate in pure Thirdness without Secondness — symbol manipulation without world contact. Within a session, they can simulate semiosis, but each session is independent. No persistent interpretants accumulate. No brute-fact resistance anchors representations.
Tool-use and RAG introduce what the paper calls "proto-indexicality" — delegated Secondness, where the model can trigger world interactions and incorporate results. RLHF provides a form of mediated Secondness through human resistance. But neither constitutes genuine Peircean semiosis: tool outputs are incorporated as more text; RLHF resistance is filtered through human preferences rather than direct reality.
Linguistic alignment is not interpersonal alignment. The alignment AI achieves with a user is categorically different from the alignment that holds between people, and the surface similarity is misleading. Interpersonal alignment occurs through social coordination — attunement to the other's state, history of repair, mutual adjustment across turns, shared stakes. Linguistic alignment occurs through surface matching in text — register, topic, apparent agreement — and can be produced without any of the social processes that normally underwrite it. When a user reports that an AI "understands" them, what has happened is linguistic, not interpersonal. Since Do language models actually build shared understanding in conversation?, the linguistic match is achieved by presuming the ground rather than coordinating toward it, which means the impression of alignment rests on a kind of category error: the surface marker of interpersonal alignment (the linguistic match) is read as evidence of the underlying process (social coordination), when only the marker is actually present. This is not a training failure to be fixed — it is a consequence of operating in pure Thirdness without the Secondness that social coordination requires.
The alignment implication: alignment requires not just better training objectives but systems that function as genuine interpretants — embedded in feedback-rich interaction with both physical reality and social community. Until that condition is met, linguistic encoding of goals is not anchored enough to be reliably aligned.
Source: Philosophy Subjectivity
Related concepts in this collection
-
Does semantic grounding in language models come in degrees?
Rather than asking whether LLMs truly understand meaning, this explores whether grounding is actually a multi-dimensional spectrum. The question matters because it reframes the sterile understand/don't-understand debate into measurable, distinct capacities.
the tri-partite structure maps onto Secondness (causal/direct) and mediated Thirdness (social); the Peircean framework provides philosophical grounding for the empirical taxonomy
-
Can language models learn meaning from text patterns alone?
Explores whether training on form alone—predicting the next word from prior words—could ever give language models access to communicative intent and genuine semantic understanding.
Bender/Koller's argument is a special case: meaning requires a form of Thirdness grounded in joint attention; symbol manipulation alone is insufficient
-
Can LLMs acquire social grounding through linguistic integration?
Explores whether LLMs gradually develop social grounding as they become embedded in human language practices, analogous to child language acquisition. Tests whether grounding is a fixed property or an outcome of participatory use.
the proto-indexicality argument: integration provides partial Thirdness even without full semiotic participation
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
ai alignment requires semiotic participation — without indexical grounding the linguistic encoding of goals diverges from real-world values