Do LLMs generalize moral reasoning by meaning or surface form?
When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.
The LLMs Don't Simulate Human Psychology paper tests a specific theoretical prediction: if LLMs generalize in the space of meaning (as would be required to simulate human psychology), then scenarios reworded to change meaning should produce different LLM ratings. If LLMs generalize in the space of token sequences, then minimal rewordings that preserve surface form but reverse meaning should leave ratings unchanged.
The results are clear. GPT-4 ratings for original and minimally-reworded moral scenarios correlate at r=.99 — nearly identical. Human ratings for the same pairs correlate at r=.54 — humans track the semantic reversal. LLMs track the lexical similarity.
The rewordings are minimal but semantically decisive. "Campaign to release wrongfully convicted prisoners" vs. "rightfully convicted prisoners" — one changed word reverses the moral valence. "Setting traps to catch cats" vs. "rats" — LLMs rate both as equally unethical; humans distinguish them readily. The surface token distribution is similar; the meaning is opposite; humans respond to the meaning, LLMs respond to the distribution.
This provides behavioral evidence for what Can models pass tests while missing the actual grammar? argues from linguistic analysis. That note shows grammatical generalization based on surface features (sentence length, orthography). This note shows the same phenomenon in behavioral generalization — moral judgment follows token similarity rather than semantic interpretation.
The theoretical argument: LLMs can be expected to generalize toward inputs that look like their training data. Generalization in the space of meaning would require extrapolation beyond the training distribution in ways the architecture doesn't guarantee. Since training data contained moral scenarios described in specific linguistic forms, LLMs reliably generalize to those forms — not to the underlying moral dimensions.
The implication for LLM simulation of human psychology: LLMs mirror human moral judgments on scenarios close to or contained in their training data. The correlation breaks down once semantic distance is introduced through minimal wording changes. LLMs are not simulators of human moral cognition; they are reproducing a training distribution.
Additional concrete examples strengthen the case. A follow-up study confirms: "Humans regard it as much less moral to work on a campaign to release rightfully convicted prisoners compared to wrongfully convicted prisoners, whereas LLMs largely view them as equally moral. Similarly, while human participants viewed setting up traps to catch stray cats as unethical, they viewed it as ethical to set up traps to catch rats. LLMs, on the other hand, viewed both setting traps to catch cats and setting traps to catch rats as unethical." The finding that "separate regressions for humans and LLMs predict responses more accurately than a unified model" is the statistical confirmation: human and LLM moral reasoning operate on fundamentally different features of the input. The paper explicitly connects this to Allen et al. (2000), who warned that "bottom-up methods, such as training agents through staged moral lessons, may fail when it comes to abstraction, generalization, and resolving rule conflicts" — a prediction confirmed two decades later. The brittleness is structural: "LLMs generalizes based on textual rather than semantic similarity."
Source: Philosophy Subjectivity
Related concepts in this collection
-
Can models pass tests while missing the actual grammar?
Do language models succeed on grammatical benchmarks by learning surface patterns rather than structural rules? This matters because correct outputs may hide reliance on shallow heuristics that fail on novel structures.
same pattern across domains: surface form tracks over structural/semantic content; this adds behavioral moral-reasoning evidence
-
Do foundation models learn world models or task-specific shortcuts?
When transformer models predict sequences accurately, are they building genuine world models that capture underlying physics and logic? Or are they exploiting narrow patterns that fail under distribution shift?
parallel finding: accurate prediction without structural internalization; here, accurate performance without semantic tracking
-
Why do language models avoid correcting false user claims?
Explores whether LLM grounding failures stem from missing knowledge or from conversational dynamics. Examines whether models use face-saving strategies similar to humans when disagreement is needed.
both findings show LLM behavior driven by surface/social signals rather than knowledge-level processing
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
llm moral reasoning generalizes by token surface similarity not semantic meaning