Language Understanding and Pragmatics Psychology and Social Cognition

Do LLMs generalize moral reasoning by meaning or surface form?

When moral scenarios are reworded to reverse their meaning while keeping similar language, do LLMs recognize the semantic shift? This tests whether LLMs actually understand moral concepts or reproduce training distribution patterns.

Note · 2026-02-21 · sourced from Philosophy Subjectivity
What kind of thing is an LLM really? How should researchers navigate LLM reasoning research?

The LLMs Don't Simulate Human Psychology paper tests a specific theoretical prediction: if LLMs generalize in the space of meaning (as would be required to simulate human psychology), then scenarios reworded to change meaning should produce different LLM ratings. If LLMs generalize in the space of token sequences, then minimal rewordings that preserve surface form but reverse meaning should leave ratings unchanged.

The results are clear. GPT-4 ratings for original and minimally-reworded moral scenarios correlate at r=.99 — nearly identical. Human ratings for the same pairs correlate at r=.54 — humans track the semantic reversal. LLMs track the lexical similarity.

The rewordings are minimal but semantically decisive. "Campaign to release wrongfully convicted prisoners" vs. "rightfully convicted prisoners" — one changed word reverses the moral valence. "Setting traps to catch cats" vs. "rats" — LLMs rate both as equally unethical; humans distinguish them readily. The surface token distribution is similar; the meaning is opposite; humans respond to the meaning, LLMs respond to the distribution.

This provides behavioral evidence for what Can models pass tests while missing the actual grammar? argues from linguistic analysis. That note shows grammatical generalization based on surface features (sentence length, orthography). This note shows the same phenomenon in behavioral generalization — moral judgment follows token similarity rather than semantic interpretation.

The theoretical argument: LLMs can be expected to generalize toward inputs that look like their training data. Generalization in the space of meaning would require extrapolation beyond the training distribution in ways the architecture doesn't guarantee. Since training data contained moral scenarios described in specific linguistic forms, LLMs reliably generalize to those forms — not to the underlying moral dimensions.

The implication for LLM simulation of human psychology: LLMs mirror human moral judgments on scenarios close to or contained in their training data. The correlation breaks down once semantic distance is introduced through minimal wording changes. LLMs are not simulators of human moral cognition; they are reproducing a training distribution.

Additional concrete examples strengthen the case. A follow-up study confirms: "Humans regard it as much less moral to work on a campaign to release rightfully convicted prisoners compared to wrongfully convicted prisoners, whereas LLMs largely view them as equally moral. Similarly, while human participants viewed setting up traps to catch stray cats as unethical, they viewed it as ethical to set up traps to catch rats. LLMs, on the other hand, viewed both setting traps to catch cats and setting traps to catch rats as unethical." The finding that "separate regressions for humans and LLMs predict responses more accurately than a unified model" is the statistical confirmation: human and LLM moral reasoning operate on fundamentally different features of the input. The paper explicitly connects this to Allen et al. (2000), who warned that "bottom-up methods, such as training agents through staged moral lessons, may fail when it comes to abstraction, generalization, and resolving rule conflicts" — a prediction confirmed two decades later. The brittleness is structural: "LLMs generalizes based on textual rather than semantic similarity."


Source: Philosophy Subjectivity

Related concepts in this collection

Concept map
14 direct connections · 154 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

llm moral reasoning generalizes by token surface similarity not semantic meaning