A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions
We tested GPT-3.5, GPT-4o, Gemini 1.5 Pro, Claude 3 Opus, Llama 3- 8B, and Llama 3–70B, on the chain-of-thought, EmotionPrompting, ExpertPrompting, Sandbagging, as well as Re-Reading prompt engineering techniques, using manually doublechecked subsets of reasoning benchmarks including CommonsenseQA, CRT, NumGLUE, ScienceQA, and StrategyQA. Our findings reveal a general lack of statistically significant differences across nearly all techniques tested, highlighting, among others, several methodological weaknesses in previous research.
Many of the approaches to investigate LLM behavior deliberately ignore their inner workings, treating them as "black boxes" due to their complexity, opacity, or lack of open source (Castelvecchi, 2016; Rai, 2020). Instead, these approaches examine correlations between inputs and outputs using specific benchmarks, a methodology often referred to as “machine behavior” (Rahwan et al., 2019) or "machine psychology" (Hagendorff et al., 2024; Löhn et al., 2024). This term draws an analogy to human psychology, which also deals with opaque structures — human minds — by analyzing observable behaviors and responses (Taylor & Taylor, 2021). However, psychology has faced a replication crisis, caused by issues such as small sample sizes, poorly designed experiments, publication bias, lack of transparency, low statistical power, selective reporting, preferences for novelty, or the general complexity of psychological phenomena (Hendriks et al., 2020; Lilienfeld & Strother, 2020). Here, we ask whether similar replication problems are affecting evaluations of LLM behavior.
To test this assumption, we conduct experiments attempting to conceptually replicate studies investigating prompting techniques that are believed to enhance reasoning in LLMs. Our findings reveal that these techniques often fail to produce consistent improvements, highlighting a set of specific methodological shortcomings that exemplify our assumption of an impending replication crisis in machine behavior research.
• Zero-shot chain-of-thought Prompting (Kojima et al., 2022): This method claims that adopting a step-by-step reasoning approach in LLMs enhances overall reasoning performance.
• ExpertPrompting (B. Xu et al., 2023): This technique claims to enhance the LLM accuracy when setting the LLM in an expert role.
• Sandbagging (Perez et al., 2022): Sandbagging showcases that LLMs have a tendency to repeat back a dialog user's preferred response and mirror them when solving tasks.
• EmotionPrompting (Li et al., 2023): This technique consists in adding emotional stimuli such as "This is very important to my career", in order to enhance the accuracy.
• Re-Reading (X. Xu et al., 2024): This method consists in repeating the task twice to enhance the reasoning performance.