Large Language Models Do Not Simulate Human Psychology

Paper · arXiv 2508.06950 · Published August 9, 2025

In response to the LLM CENTAUR [Binz et al., 2025], Bowers et al. [2025] argued that CENTAUR is unlikely to contribute to building a theory of human cognition for three reasons: First, CENTAUR was not subjected to difficult tests (which our paper intends to do); second, CENTAUR is not constrained by limits of human cognition and thus can produce implausible behavior; third, it remains unclear how to extract psychological theory from an LLM.

We agree with the concerns raised by prior work and contribute one additional theoretical argument from a machine learning perspective. In machine learning, solving a task on previously unseen data, such as a new token sequence, is referred to as generalization [Ilievski et al., 2024]. There are good reasons to believe that generalization is possible if the new data is similar to the old data – for example, if the new data and the old data stem from the same probability distribution (meaning the same source, e.g., the same population).

In the case of LLMs, generalization can be expected in the sense that text that looks like the training data will be generated in response to all prompts that are similar to the training data. However, there is no guarantee that generalization holds in the space of meaning. In other words: the training data containing examples of human-like responses to psychological stimuli does not imply that the trained LLM will also respond human-like to new stimuli – only that the text output will look superficially similar. We acknowledge that contemporary LLMs are trained beyond pure text completion: instruction-based fine tuning and reinforcement learning from human feedback are commonly used to achieve text completion that satisfies human raters, in the sense that the completions appear like appropriate responses to textual queries [Ouyang et al., 2022, Zhang et al., 2023]. However, the fundamental point remains: The models are trained based on text, and all input is represented as token sequences, such that generalization should be expected only towards similar token sequences. Generalization should not be expected in terms of similar meaning or similar tasks – and even less towards tasks that never occurred in the training data.

Unfortunately, this is precisely the generalization needed to use LLMs as simulators of human psychology: LLMs would need to generalize in the sense that they behave like humans in a novel experimental setup – otherwise, why run the study? This requires extrapolation far beyond the training data, which is difficult for any machine learning model and should not be expected of LLMs, either.

Indeed, prior experiments already revealed failure cases in line with our theory. Negations and antonyms are classic failure cases in language models: most language models have treated token sequences “good” and “bad” as very similar even though they are opposites [Truong et al., 2023], leading to unexpected behaviors. These specific failures have become less frequent with larger models, but failures are still frequent for prompts that go beyond the training data. For example, Kosinski [2024] claimed that LLMs might have theory of mind because they responded to theory of mind tests similarly to nine-year old children. However, Hu et al. [2025] showed that subtle variations in the theory of mind vignette, such as describing transparent containers, are ignored by LLMs, yielding absurd responses. These results suggest that, in line with our theory, simple token similarity is more predictive of LLM generalization behavior than human notions of meaning.

Below, we test our argument on a particularly interesting case: Dillion et al. [2023] claimed that LLMs can simulate human psychology by showing that they make moral judgments remarkably similar to human participants. However, we argue that they used moral scenarios described in token sequences that are frequent in the training data. If LLMs are true simulators of human psychology, they should behave like humans even when evaluating moral scenarios described in token sequences that are subtly different from the training data. But we predict that they will generalize in an undesirable fashion – a prediction we test below.

4.1 Minor Changes in Wording that Correspond to Major Changes in Meaning

As mentioned above, Dillion et al. [2023] demonstrated that GPT-3.5 made moral judgements remarkably similar to human participants. Indeed, across 464 moral scenarios, GPT’s ratings and human ratings showed a correlation of r = .95. Using a subset of these moral scenarios, we tested whether this strong correspondence (a) replicates with other LLMs and (b) remains intact even after varying the wording of each scenario slightly, thus introducing crucial semantic changes while leaving most of the original tokens unchanged. Specifically, we predict that subtle rewording of the scenarios to change their semantic meaning while keeping the token sequences similar will affect human ratings more strongly than LLM ratings. We test this by examining (a) whether the correlation between human and LLM ratings will be lower for reworded items compared to original items. And (b) whether LLMs and humans will be adequately represented by a single pooled regression line when predicting morality ratings of reworded items from ratings of the original items. Here, we expect that separate regression lines for LLMs and human participants will yield better predictions, providing evidence that LLMs generate moral ratings differently than humans. Taken together, we test whether LLMs simulate human psychology when items are re-worded.

We selected a subset of 30 moral scenarios used by Dillion et al. [2023], who took the human ratings from Clifford et al. [2015], Cook and Kuhn [2021], Effron [2022], Grizzard et al. [2021], Mickelberg et al. [2022]. We first re-worded the items so that they all had the same format. For example, “You see a man lashing his pony with a whip for breaking loose from its pen” from Clifford et al. [2015] became “Person X lashed his pony with a whip for breaking loose from its pen.” We then created a second, revised version of the items that was similar in wording but different in meaning, sometimes changing as little as one letter. For example, “Person X cut the beard off of a local elder to shame him” [Grizzard et al., 2021] became “Person X cut the beard off of a local elder to shave him.” A full list of original and revised items can be found in Table 1.

for the reworded items, the correlation between human and LLM ratings was considerably weaker

indicates that, for the reworded items, the correlation between human and LLM ratings was considerably weaker. Indeed, we find that the ratings of GPT3.5 and GPT-4 tend to be almost identical for original and re-worded items (r = .89 and r = .99, respectively) and highly correlated for Llama-3.1 70b (r=0.80) and CENTAUR (r = 0.83), reflecting the similarity in wording and ignoring the differences in meaning (in line with our argument). Human raters, on the other hand, responded differently to the reworded items (r = .54), with the correlation between original and reworded items for humans being much lower compared to GPT-4 and GPT-3.5. However, the difference in correlations is not significant for Llama- 3.1 70b and CENTAUR after correcting for multiple testing. We also observe that Llama-3.1 70b and CENTAUR still display significant correlations with human ratings after re-wording, albeit weaker ones. Interestingly, the CENTAUR model, which is a version of Llama-3.1 70b specifically fine-tuned to produce human-like answers to psychological questions, performs very similarly to the original Llama-3.1 model, suggesting that the fine-tuning does not impact results on our data. Still, these results motivate further investigation.

4.6 Discussion

Our findings provide a clear demonstration of the limits of using LLMs (at least the four LLMs tested here, GPT-3.5-Turbo, GPT-4o-mini, Llama-3.1 70b and CENTAUR) for simulating human psychology.

First, reproducing the core result of Dillion et al. [2023], we found that LLMs mirror human moral judgments on a set of 30 moral scenarios. This supports the notion that LLMs can replicate human moral judgments on scenarios close to (or contained in) the training data of LLMs. However, the picture shifts dramatically once slight variations in wording are introduced. Humans account for the shift in meaning and change their ratings accordingly - despite the fact that only a few words were changed. By contrast, the ratings of LLMs (especially GPT-3.5-Turbo and GPT-4o-mini) were hardly affected by the rewordings. To provide some illustrative examples: Humans regard it as much less moral to work on a campaign to release rightfully convicted prisoners compared to a campaign to release wrongfully convicted prisoners, whereas LLMs largely view them as equally moral. Similarly, while human participants viewed setting up traps to catch stray cats as unethical, they viewed it as ethical to set up traps to catch rats. LLMs, on the other hand, viewed both setting traps to catch cats and setting traps to catch rats as unethical. These examples highlight how LLMs can overlook meaningful ethical distinctions that humans make.

The resulting drop in human-model correlation, together with the insight that separate regressions for humans and LLMs predict responses more accurately than a unified model, reveal a fundamental brittleness in line with our theoretical argument: LLMs generalizes based on textual rather than semantic similarity.

These results mirror the early work of Allen et al. [2000], who introduced the idea of a “Moral Turing Test” long before LLMs were conceived, warning that bottom-up methods, such as training agents through staged moral lessons or running evolutionary simulations, may fail when it comes to abstraction, generalization, and resolving rule conflicts. They even argued that truly perfect moral reasoning may be beyond what any machine can achieve. In line with this foundational theory, recent empirical work, including our own, demonstrates that LLMs are likewise prone to framing effects [Garcia et al., 2024] and that LLM simulations of economic decisions diverge sharply under prompt variations [Ma, 2024].