The Moral Turing Test: Evaluating Human-LLM Alignment in Moral Decision-Making

Paper · arXiv 2410.07304 · Published October 9, 2024

As large language models (LLMs) become increasingly integrated into society, their alignment with human morals is crucial. To better understand this alignment, we created a large corpus of humanand LLM-generated responses to various moral scenarios.We found a misalignment between human and LLM moral assessments; although both LLMs and humans tended to reject morally complex utilitarian dilemmas, LLMs were more sensitive to personal framing. We then conducted a quantitative user study involving 230 participants (N=230), who evaluated these responses by determining whether they were AI-generated and assessed their agreement with the responses. Human evaluators preferred LLMs’ assessments in moral scenarios, though a systematic anti-AI bias was observed: participants were less likely to agree with judgments they believed to be machine-generated. Statistical and NLP-based analyses revealed subtle linguistic differences in responses, influencing detection and agreement. Overall, our findings highlight the complexities of human-AI perception in morally charged decision-making.

• Participants prefer AI justifications over human justifications in morally-complex scenarios: Although participants preferred human justifications when the stakes were low (e.g., in non-moral scenarios), they significantly preferred LLM-generated justifications in personal moral scenarios (such as when explaining how they would handle the trolley problem), where LLMs exhibited much stronger utilitarian preferences than humans. Participants’ preference for AI in these scenarios may stem from a preference for deliberative reasoning in high-stakes settings.

• However, participants exhibit a strong anti-AI bias: Even though participants favored the justifications produced by LLMs, they reported disagreement if they suspected that the output was LLM-generated. Across all types of scenarios, participants exhibited a notable anti-AI bias. This result is robust to our efforts to conceal the identity of the LLM through “humanizing” linguistic features, such as introducing typos.

• Subtle contextual and linguistic cues can reveal AI authorship: Participants were able to detect the source of generated justifications with moderate accuracy. The detection rate was higher in moral scenarios (such as the trolley problem) than in non-moral scenarios. Slight linguistic differences, such as an increased use of first-person pronouns in human explanations and more pedantic, analytical LLM-generated explanations, provided some signal.