Self-critiquing models for assisting human evaluators

Paper · arXiv 2206.05802 · Published June 12, 2022

We fine-tune large language models to write natural language critiques (natural language critical comments) using behavioral cloning. On a topic-based summarization task, critiques written by our models help humans find flaws in summaries that they would have otherwise missed. Our models help find naturally occurring flaws in both model and human written summaries, and intentional flaws in summaries written by humans to be deliberately misleading.

In this work we explore a simple form of assistance: natural language critiques of model outputs. Critiques are a particularly natural form of assistance from the point of view of preventing misleading outputs. If a human evaluator doesn’t carefully check a model’s outputs, the model might learn to give solutions that look good to the evaluator but are systematically flawed in a way that exploits human biases. We hope an equally smart critique model can help humans to notice these flaws. If models can generate outputs they “know” have flaws, but cannot explain these flaws to human evaluators, then they won’t be effective assistants. This further motivates us to improve a model’s ability to critique relative to its ability to discriminate answer quality.

(4) We motivate and measure generator-discriminator-critique gaps (Section 5). We propose a new methodology to compare a model’s ability to generate answers, discriminate answer quality, and critique answers. Using the methodology, we study the scaling trends on topic-based summarization and in synthetic domains. In our experiments we failed to find a clear trend showing critique performance catching up to discriminator performance, implying that larger models still have relevant knowledge they don’t articulate as critiques.