Can language meaning emerge without joint attention and shared embodied interaction?
This explores the deepest fault line in the corpus: whether meaning is something text patterns alone can reconstruct, or whether it requires the shared pointing, intent, and bodily co-presence that humans bring to language — and the collection holds strong arguments on both sides.
This question splits the corpus down the middle, and the disagreement is the interesting part. On one side, the relational view: language models seem to operationalize Saussure's idea of *langue* — meaning as the pattern of differences between words — purely by compressing the relational structure of text, with no external referents and no body Can language models learn meaning without engaging the world?. By that account, a remarkable amount of what we call meaning is already latent in how words relate to other words, and an LLM can recover it without ever pointing at anything. The models predicting social norms better than human raters push the same way — they pick up the shape of human appropriateness judgments without having lived a single one Can AI systems learn social norms without embodied experience?.
On the other side stands the classic objection: Bender and Koller argue that meaning is the relation between expressions and the *communicative intents* behind them, and since a model trained on form-to-form prediction never has access to shared attention or intent, it cannot reconstruct that relation at all Can language models learn meaning from text patterns alone?. The question's two phrases — joint attention and shared embodiment — actually map onto two distinct gaps the corpus identifies. One note separates them cleanly: an LLM's grounding is *functionally* strong (it uses language correctly) but *socially* weak (no participatory agency in a real exchange) and *causally* weak (no embodied contact with a world) What grounds language understanding in systems without embodiment?. So 'meaning' may not be one thing that's present or absent — it may be a stack, with the linguistic layer recoverable from text and the social and causal layers not.
Where it gets sharper is the claim that the missing ingredient isn't knowledge but *event structure*. One striking note argues that AI doesn't produce utterances at all — it produces 'event-residue,' text carrying the surface markers of communication but lacking the shared event that makes an utterance an utterance; the human reader unilaterally supplies the missing orientation, animating a one-sided pseudo-exchange Does AI generate genuine utterances or just text patterns?. This dovetails with the view that subjecthood itself is produced *within* communicative events rather than possessed beforehand Does language create subjects or express them?. If meaning lives in the event of joint attention, then a system that only outputs residue can't host it — the meaning happens on the human side.
The embodiment half of the question gets its hardest answer in the consciousness note: disembodied models can't even be *candidates* for consciousness, because the very vocabulary of mind originates from entities sharing a world through co-presence and triangulation on common objects Can disembodied language models ever qualify as conscious?. Triangulation — two minds attending to the same third thing — is joint attention in its purest form, and it's exactly what text lacks. Notably, two corpus papers try to *engineer* substitutes: collaborative rational speech acts add bidirectional belief-tracking so a system can move from partial to shared understanding across turns Can dialogue systems track both speakers' beliefs across turns?, suggesting joint attention might be partly reconstructable as an information-theoretic mechanism rather than a metaphysical given.
The thing you might not have expected to learn: the corpus doesn't resolve into yes or no, but into a *dissolution* of the question. 'Meaning' turns out to be layered — relational meaning emerges from text alone, but referential and social meaning seem to require the shared event. The live research question isn't whether models have meaning, but how much of human meaning was always carried by language itself, waiting in the relational structure, versus how much we were quietly supplying through our bodies and our shared attention the whole time.
Sources 8 notes
Research shows LLMs learn culturally situated discourse patterns by compressing relational structure from text, demonstrating that fluent language generation requires no external referents or embodied grounding.
GPT-4.5 predicted appropriateness of 555 social scenarios at the 100th percentile compared to human raters, with Gemini and Claude also exceeding 96% accuracy. However, all models show identical systematic errors, revealing boundaries of pattern-based social understanding that embodied experience may still be necessary to cross.
Bender & Koller argue that meaning requires the relation between expressions and communicative intents. Since LLMs are trained only on form-to-form prediction with no access to shared attention or intent, they cannot reconstruct the meaning that grounds language.
Language models achieve functional grounding through relational language patterns but lack social grounding through participatory agency and causal grounding through embodied environmental contact. Social grounding can increase through human integration, but linguistic agency requires architectural changes beyond training.
AI output carries communicative markers inherited from training data but lacks the event structure that produces actual utterances. Users supply the missing orientation through interpretive labor, creating a pseudo-event with structure only on the human side.
Subjecthood is produced within communicative events, not possessed prior to them. This convergent position across philosophy, linguistics, and cognitive science inverts the standard picture of language as a tool used by pre-existing subjects.
Current disembodied LLMs cannot be candidates for consciousness because consciousness language originates from and applies only to entities sharing a world with us through co-presence and triangulation on shared objects.
CRSA integrates rate-distortion theory with RSA to enable bidirectional belief tracking across dialogue turns. Demonstrated on referential games and doctor-patient dialogues, it captures progression from partial to shared understanding, providing the information-theoretic framework that token-level LLM systems lack.