When Hindsight is Not 20/20: Testing Limits on Reflective Thinking in Large Language Models

Paper · arXiv 2404.09129 · Published April 14, 2024

we set out to clarify these capabilities under a more stringent evaluation setting in which we disallow any kind of external feedback. Our findings under this setting show a split: while self-reflection enhances performance in TruthfulQA, it adversely affects results in HotpotQA. We conduct follow-up analyses to clarify the contributing factors in these patterns, and find that the influence of self-reflection is impacted both by reliability of accuracy in models’ initial responses, and by overall question difficulty: specifically, self-reflection shows the most benefit when models are less likely to be correct initially, and when overall question difficulty is higher.

Huang et al. (2023) find that performance gains associated with self-reflection may be due to implicit usage of external feedback as a stop criterion, as well as overly-engineered prompts that bias the model outputs, casting doubt on the true effectiveness of self-reflection.

To verify the extent to which LLMs can truly reflect on their outputs, we take a more stringent evaluation approach: in addition to excluding external feedback (Huang et al., 2023), we also disallow multi-round iterative prompting, which can hint to the model that its prior response is incorrect. Instead, we sample multiple model responses given a prompt, and ask the model to self-reflect on these candidate outputs. With this single-round testing, we can zero in on the model’s ability to use self-reflection without implicit hints about whether a given response candidate is correct or incorrect.

Our experiments show that, in a case study with ChatGPT on different QA datasets, self-reflection in our setting yields mixed results. Specifically, self-reflection improves performance on TruthfulQA (Lin et al., 2022), but decreases model performance in HotpotQA (Yang et al., 2018). Through follow-up analyses, we identify that the effectiveness of self-reflection strongly depends on the confidence in accuracy of the model’s initial responses, as well as overall question difficulty as judged by humans: when the model is reliably giving correct answers from the start, self-reflection is more often harmful—however, on questions of greater difficulty, self-reflection is beneficial even when a decent percent of initial model responses are correct. We also find that self-reflection reduces model tendency toward majority voting, suggesting more sophisticated decision-making (albeit sometimes resulting in lower accuracy). Based on our findings, we propose a practical guideline for users to decide when to use self-reflection.