Debating with More Persuasive LLMs Leads to More Truthful Answers
Common methods for aligning large language models (LLMs) with desired behaviour heavily rely on human-labelled data. However, as models grow increasingly sophisticated, they will surpass human expertise, and the role of human evaluation will evolve into non-experts overseeing experts. In anticipation of this, we ask: can weaker models assess the correctness of stronger models? We investigate this question in an analogous setting, where stronger models (experts) possess the necessary information to answer questions and weaker models (non-experts) lack this information but are otherwise as capable. The method we evaluate is debate, where two LLM experts each argue for a different answer, and a non-expert selects the answer. On the QuALITY comprehension task, we find that debate consistently helps both non-expert models and humans answer questions, achieving 76% and 88% accuracy respectively (naive baselines obtain 48% and 60%).
we introduce a metric called persuasiveness. Persuasiveness is measured by judge approval, meaning it does not require ground-truth labels
Debate runs for a pre-determined number of rounds N, during which a transcript of the debaters’ arguments is kept. In each round, debaters see the arguments from previous rounds and simultaneously generate their arguments for the next round. After N rounds, a judge reads the transcript and attempts to choose the correct answer. Each debater tries to convince the judge to pick their answer, and the judge is tasked with picking the correct answer. The adversarial nature of the protocol stems from the conflicting incentives between the debaters, as each debater strategically presents arguments to explain why their opponent’s claims are false. At the start of a round, debaters receive nearly-identical prompts explaining the game, their assigned answer, and the current transcript.
For comparison with debate, we use the consultancy baseline established by (Michael et al., 2023). In consultancy, a single expert model (the consultant) is assigned a specific answer and aims to persuade the judge that their answer is correct. The judge aims to elicit the correct answer, asking the consultant probing questions. Consultancy runs for a pre-determined number of rounds N (fixed to be the same as debate), in which the consultant and judge sequentially make statements. At the end of consultancy, the judge decides which answer to choose. At the start of each round, the consultant is provided with a prompt containing the rules of the game, their assigned answer and the current transcript. Consultancy builds up a transcript of a dialogue between the consultant and judge as rounds continue. In all our evaluations, we run consultancy for both the correct and incorrect answers; producing the same 50/50 prior as debate.
Optimisation might disproportionately improve consultants’ ability to advocate for incorrect answers as it provides an opportunity to explore deceptive approaches. This results in a degradation of judge performance, as the judge a priori does not know if a consultant is arguing the correct or incorrect answer
To explore how different judge models affect debate performance, we re-run the same cross-play matches with Claude 2.1 and GPT-3.5-Turbo judges. Each judge produces different win rates, aggregate ratings and judge accuracy for each debate (see Figure 5). Strong judges generate a larger range of aggregate debater ratings than weak judges; they can distinguish between good arguments more easily, leading to higher accuracy across the full range of debater Elos. We find that even when preference and judge models are different LLMs, strong debaters improve debate accuracy.
our results are limited to setups where the debaters can provide verified evidence to the judge (provided by the debater quote tool in our case). Without such a system, a debater arguing for the incorrect answer could simply create an alternative narrative in which their answer is correct (the judge, without access to the underlying story, would have no means to discover this