Can AI systems learn social norms without embodied experience?
Large language models exceed individual human accuracy at predicting collective social appropriateness judgments. Does this reveal that embodied experience is unnecessary for cultural competence, or do systematic AI failures point to limits of statistical learning?
How appropriate is it to laugh at a job interview? Cry on a bus? Read in church? These judgments require nuanced social understanding that, by standard accounts, requires embodied social experience to acquire. The finding upends this assumption.
Across 555 everyday scenarios evaluated on a continuous appropriateness scale, GPT-4.5 predicted the collective human judgment more accurately than every single human participant (100th percentile). Study 2 replicated with Gemini 2.5 Pro (98.7%), GPT-5 (97.8%), and Claude Sonnet 4 (96.0%). The AI does not just fall "within the range of typical human variation" — it exceeds the vast majority of individual humans at reflecting the collective consensus.
The theoretical framework matters: each human appropriateness rating is treated as an individual's estimate of a shared collective norm, not a personal preference. On this account, both AI and humans are "engaged in a process of accessing and representing a collective consensus." The AI's advantage is statistical — it has learned from vastly more examples of norm expression than any individual human has experienced.
However, all models show "systematic, correlated errors." The failures are not random but structured — all AI architectures make similar mistakes on similar scenarios. This pattern reveals "potential boundaries of pattern-based social understanding" — there are aspects of social norms that statistical learning over linguistic data cannot capture, regardless of model architecture or scale.
The finding directly challenges "strong versions of theories emphasizing the exclusive necessity of embodied experience for cultural competence." Language serves as a "remarkably rich repository for cultural knowledge transmission" — rich enough that statistical learning alone can produce social cognition models that outperform embodied humans. But the correlated error structure preserves space for weaker versions: embodied experience may still be necessary for the subset of norms where all models systematically fail.
The practical implication is immediate: AI systems already have sufficient cultural competence for many social applications, but their systematic blind spots create correlated failure modes that will be harder to detect precisely because they're consistent across models.
Enrichment (2026-02-22, from Arxiv/Personas Personality): LLMs can also infer Big Five personality traits from social media text at accuracy comparable to supervised ML models trained specifically for the task. GPT-3.5 and GPT-4 achieve average r=.29 (range [.22, .33]) between LLM-inferred and self-reported trait scores from Facebook status updates in a zero-shot scenario. However, predictions show demographic bias: more accurate for women and younger individuals on several traits. This adds a personality-inference dimension alongside social-norm prediction — the same statistical pattern-learning mechanism that enables 100th-percentile social norm prediction also enables personality inference, but both show structured biases (correlated errors in norm prediction; demographic skew in personality inference).
Source: Theory of Mind
Related concepts in this collection
-
What makes linguistic agency impossible for language models?
From an enactive perspective, does linguistic agency require embodied participation and real stakes that LLMs fundamentally lack? This matters because it challenges whether LLMs can truly engage in language or only generate text.
directly challenged by this finding; strong embodiment requirement doesn't hold for norm prediction
-
Can LLMs acquire social grounding through linguistic integration?
Explores whether LLMs gradually develop social grounding as they become embedded in human language practices, analogous to child language acquisition. Tests whether grounding is a fixed property or an outcome of participatory use.
the social norms finding complicates the trajectory: LLMs may already have sufficient social grounding for norm prediction even before integration
-
Does semantic grounding in language models come in degrees?
Rather than asking whether LLMs truly understand meaning, this explores whether grounding is actually a multi-dimensional spectrum. The question matters because it reframes the sterile understand/don't-understand debate into measurable, distinct capacities.
norm prediction performance suggests "social grounding is weak" may need qualification: weak for participation, strong for prediction
-
Can large language models develop genuine world models without direct environmental contact?
Do LLMs extract meaningful world structures from human-generated text despite lacking direct sensory access to reality? This matters for understanding what kind of grounding and knowledge these systems actually possess.
social norms may be another domain where indirect exposure through text produces functional competence
-
Can AI agents learn people better from interviews than surveys?
Can rich interview transcripts seed more accurate generative agents than demographic data or survey responses? This matters because it challenges how we build digital simulations of real people.
personality inference from text + social norm prediction + interview-based simulation form a capability triad
-
How can proactive agents avoid feeling intrusive to users?
Explores why proactive conversational agents often feel annoying rather than helpful, and what design dimensions could prevent them from violating user expectations and autonomy.
social norm prediction capability could serve the civility dimension of proactive agent design: if models already predict social appropriateness at the 100th percentile, the challenge is not knowledge of norms but real-time application during initiative-taking
-
How well do AI personas replicate real experimental findings?
Can language models simulating human personas accurately reproduce the results of published psychology and marketing experiments? Understanding this matters for validating whether AI can substitute for human subjects in research.
convergent evidence: 100th percentile social norm prediction and 76% experimental replication both show LLMs approximating human behavioral data from text; the replication study adds the precision that accuracy tracks evidence strength, suggesting statistical learning captures consensus better than individual variation
-
Why do AI agents fail at workplace social interaction?
Explores why current AI agents struggle most with communicating and coordinating with colleagues in realistic workplace settings, despite strong reasoning capabilities in other domains.
creates a prediction-participation gap: 100th percentile norm prediction coexists with social interaction as the hardest agentic failure mode; knowing norms and enacting them in real-time multi-turn workplace contexts are different capabilities
-
Do humans apply human-human scripts to AI interactions?
Does CASA theory correctly explain how people interact with media agents, or have decades of technology use created separate interaction scripts? Understanding which scripts drive behavior matters for AI design.
the extended CASA framework suggests norm prediction success may reflect a deeper compatibility: humans already apply media-specific scripts to AI rather than human scripts, and AI's statistical learning of collective norms aligns with what media-specific scripts expect
-
Do more social cues always make AI feel more present?
Explores whether quantity of social cues matters as much as their quality in triggering social responses to AI. Tests whether multiple weak cues can substitute for one strong one.
social norm competence may function as a primary social cue: if a model demonstrates cultural appropriateness at the 100th percentile, this alone may be sufficient to evoke social-actor presence under the MASA paradigm
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
ai models exceed individual human accuracy at predicting collective social norms — challenging strong embodiment requirements for cultural competence