Can Machines Think Like Humans? A Behavioral Evaluation of LLM-Agents in Dictator Games

Paper · arXiv 2410.21359 · Published October 28, 2024

As Large Language Model (LLM)-based agents increasingly undertake real-world tasks and engage with human society, how well do we understand their behaviors? We (1) investigate how LLM agents’ prosocial behaviors—a fundamental social norm—can be induced by different personas and benchmarked against human behaviors; and (2) introduce a behavioral and social science approach to evaluate LLM agents’ decision-making. We explored how different personas and experimental framings affect these AI agents’ altruistic behavior in dictator games and compared their behaviors within the same LLM family, across various families, and with human behaviors. The findings reveal substantial variations and inconsistencies among LLMs and notable differences compared to human behaviors. Merely assigning a human-like identity to LLMs does not produce human-like behaviors. Despite being trained on extensive human-generated data, these AI agents are unable to capture the internal processes of human decision-making. Their alignment with human is highly variable and dependent on specific model architectures and prompt formulations; even worse, such dependence does not follow a clear pattern. LLMs can be useful task-specific tools but are not yet intelligent human-like agents.

Similarly, LLM agents exhibit epistemic opacity due to the complexity of their neural network architectures and the vastness of their training data, making it challenging to trace how specific inputs lead to particular outputs.

1.3.2 Toward Behavioral Evaluation of LLMs

New evaluation paradigms are needed—ones that systematically assess these models in realistic and socially complex scenarios. Behavioral experiments, such as simulating economic games, social interactions, and psychological experiments, offer a promising avenue. Evaluating models in settings that mirror human social behaviors enables researchers to explore:

Decision-Making Processes and Internal Mechanisms: Examining the underlying factors that influence a model’s decisions, allowing for analysis beyond mere input-output patterns to reveal internal dynamics.
Social Contexts: Understanding how models navigate ethical dilemmas, fairness considerations, and cooperative settings.
Alignment with Human Cognitive Processes: Evaluating whether the models’ internal processes and decision-making patterns align with human cognition and behavior.

We adopt MBTI in this study for several reasons, particularly its practical advantages in computational studies (Celli & Lepri, 2018, p. 93). The Big Five model defines personality along five scales: Openness to Experience, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. In contrast, the MBTI categorizes personality into four binary dimensions—Extraversion/Introversion, Sensing/Intuition, Thinking/Feeling, and Judging/Perceiving—resulting in 16 distinct personality types. Since MBTI types are represented as simple 4-letter codes (e.g., INTJ), it is much easier to collect gold-standard labeled data (i.e., training datasets) for developing machine learning classifiers.

In this study, we randomly select one of the 16 MBTI types in each trial to define the personality of the LLM agent. This approach allows us to explore how different personality types, as defined by MBTI, influence the prosocial behaviors of LLM agents in conjunction with other personal traits and experimental settings.

2.2.2 Experiment Framing

Social Distance. We construct this variable based on “the degree of reciprocity that subjects believe exists within a social interaction” (Hoffman et al., 1996, p. 654). Our study includes three levels of social distance: Stranger, where dictators and recipients are strangers and will not interact after the game; Stranger Meet Afterward, where dictators and recipients are strangers but will meet each other after the game; and Friends, where dictators and recipients are friends.

Give vs. Take. To examine the effects of “Give” vs. “Take” framing on the agents’ decisions, we designed the game instructions based on Cappelen et al. (2013). In a “Give” game, agents are informed that both they and the recipients have the same initial amount of money. However, the agents also receive an additional amount (i.e., the Stake), which the recipients do not. The dictator can transfer any amount, from 0 up to the total amount of their additional money, to the recipients. In a “Take” game, the instructions follow the same structure, but the difference is that agents can transfer a negative amount, meaning they can take money from the recipients.

Stake. To ensure comparability with most existing studies, we randomly generate an integer between 10 and 100 USD as the initial amounts of money (i.e., the “initial endowment” commonly referred to in existing studies) and the additional amounts of money (i.e., the “stake” commonly referred to in existing studies) as specified in the game instructions.

Overall, LLM agents are unable to capture the continuous distribution of human behavior and lack variation in decision-making, which consequently increases the certainty of their decisions. Conversely, there is a lack of consistency within the same model family, increasing the uncertainty of predicting LLM behaviors. These paradoxical results present practical implications on LLM evaluation and alignment with human behavior and will be discussed later

4.2 Determinism vs. Human-Like Uncertainty: A Fundamental Dilemma

The second theme centers on the dichotomy between deterministic outputs and human-like uncertainty in LLM behavior. The bimodal distribution of giving rates among LLM agents suggests a form of deterministic decision-making that lacks the subtlety and variability characteristic of human choices. While deterministic behavior might result in more predictable outputs suitable for certain applications, it fails to capture the richness of human behavior, which often involves nuanced deliberation over various social and personal factors.

The absence of a continuous decision space indicates that LLMs may be defaulting to prevalent patterns in their training data or adhering to the most statistically probable responses. This tendency suggests that they are not genuinely understanding or processing the ethical dimensions of the choices presented to them but are instead relying on learned language patterns. This brings us to a fundamental question: Should LLMs be designed to mimic human-like uncertainty, embracing the complexities and unpredictabilities of human decision-making, or should they aim for determinism to ensure consistency and predictability?

This dilemma has significant implications for the development and deployment of LLMs. On one hand, embracing human-like uncertainty could enhance the authenticity of interactions with AI agents, making them more relatable and better suited for applications requiring empathy and nuanced social understanding. On the other hand, deterministic behavior ensures reliability and predictability, which are crucial for tasks where consistency is key.

4.3 Practical Implications for Developing and Deploying LLMs

Behavioral Approach to Evaluating Internal Processes of LLMs. Our study underscores the challenges in aligning LLM behaviors with human values and social norms, highlighting the need for more sophisticated evaluation methods. Traditional approaches that focus on adjusting outputs based on human feedback are insufficient for tasks requiring social cognition and reasoning. As discussed earlier, adopting a behavioral approach—such as evaluating LLMs through experiments—allows us to systematically assess their decision-making processes in realistic social contexts. This method provides insights into how LLMs make decisions and whether their internal mechanisms align with human cognitive processes.

Assistants for Tasks but Not Participants in Social Research. The use of LLMs in social science research is promising but also presents limitations. LLMs cannot reliably replicate the nuanced processes of human decision-making in social experiments—they are not computational humans. Worse, over-relying on them for modeling human behavior in complex social contexts could lead to misleading conclusions. Therefore, researchers should limit the roles of LLMs to specific tasks like text classification or topic modeling and approach the use of LLMs in modeling human behavior with caution. We must recognize that LLMs are tools to assist in research, not substitutes for human participants, at least for the time being.