Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies

Paper · arXiv 2208.10264 · Published August 18, 2022

We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model’s simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings were replicated using recent models, while the last TE reveals a “hyper-accuracy distortion” present in some language models (including ChatGPT and GPT-4), which could affect downstream applications in education and the arts.

….

Recent independent related works consider questions related to the similarity between humans and LMs. Several works use human failure modes to reason about LM failure modes. Jones & Steinhardt (2022) use human cognitive biases, such as anchoring and framing effects, to evaluate an LM’s “errors” where it deviates from rational behavior. Binz & Schulz (2023) use cognitive psychology tests to address the question of whether LMs “learn and think like people.” Hagendorff et al. (2022) tested GPT-3.5 using cognitive response tests and found that the LM’s error mode “mirrors intuitive behavior as it would occur in humans in a qualitative sense.” Dasgupta et al. (2022) test LMs on abstract reasoning problems and find that “such models often fail in situations where humans fail – when stimuli become too abstract or conflict with prior understanding of the world.” While these works studied the capabilities of current LMs, we introduce a new evaluation methodology that illustrates how LM outputs can capture aspects of human behavior.