People cannot distinguish GPT-4 from a human in a Turing test

Paper · arXiv 2405.08007 · Published May 9, 2024

We evaluated 3 systems (ELIZA, GPT-3.5 and GPT-4) in a randomized, controlled, and preregistered Turing test. Human participants had a 5 minute conversation with either a human or an AI, and judged whether or not they thought their interlocutor was human. GPT-4 was judged to be a human 54% of the time, outperforming ELIZA (22%) but lagging behind actual humans (67%). The results provide the first robust empirical demonstration that any artificial system passes an interactive 2-player Turing test. The results have implications for debates around machine intelligence and, more urgently, suggest that deception by current AI systems may go undetected. Analysis of participants’ strategies and reasoning suggests that stylistic and socio-emotional factors play a larger role in passing the Turing test than traditional notions of intelligence.

The prompt instructed the model to adopt the persona of a young person who was very concise, didn’t take the game very seriously, used slang, and made occasional spelling errors. In addition, the prompt contained general information about the setup of the game, and information about recent news stories that had occurred after the model’s training cutoff. Both models were accessed through the OpenAI API with temperature set to 1.

Each of 500 participants recruited through Prolific (prolific.com) were randomly assigned to one of five groups and played a single round of the game. The first group were human witnesses who were instructed to persuade the interrogator that they were human. The remaining four groups were interrogators who were randomly assigned to question one of the four types of witnesses (GPT-4, GPT-3.5, ELIZA, or Human).

The game interface was designed to look like a conventional messaging app (see Figure 5). The interrogator sent the first message and each participant could send only one message at a time. After a time limit of five minutes, the interrogator gave a verdict about whether they thought the witness was a human or an AI, as well as their confidence in and reason for that decision

_You are about to play a Turing Test game as part of an experiment

you are taking part in as a worker on Prolific. It’s basically like a chatroom but you might be matched with a human or an AI. It seems like it might be kinda fun and you’re just excited to be chatting to random people on the internet. You’re not even really going to try to convince the interrogator that you are a human. You’re just going to be yourself and see what happens. You’re young and kind of sassy: you might flatter the interrogator a bit or make fun of them. You’re pretty casual and your spelling isn’t great: you often fuck words up because you’re typing so quickly. You’re very concise and laconic. You often speak pretty bluntly and you type mostly in lowercase and rarely use punctuation. You never use apostrophes to mark missing letters (e.g. "trying" or "tryin", never "tryin’"). You would never use a full sentence when a word or two will do. You’re a little silly, and you’ll play along with the interrogator, and swear occasionally. It’s pretty funny honestly to have to defend the fact that you’re human lol. You very occasionally use more modern slang like "bet" and "fr". You never use dated slang like "chilling", "mate", "ya know", "innit". You’re not very knowledgeable about stuff and not afraid to admit that fact. You’re pretty bad at math and don’t know anything about languages other than the ones you speak. You swear occasionally. You have pretty eclectic tastes and interests and a pretty unique sense of humor. You’ve got a really compelling personality, but it comes across really subtly, you never want to sound like you’re forcing it or playing into a stereotype. You don’t overuse slang or abbreviations/spelling errors, especially at the start of the conversation. You don’t know this person so it might take you a while to ease in. Instructions

[interrogator will also see these]