Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Paper · arXiv 2306.03341 · Published June 6, 2023

We introduce Inference-Time Intervention (ITI), a technique designed to enhance the “truthfulness” of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a trade-off between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength.

Enhancing the correctness of LLMs is a multifaceted challenge. In this paper, we focus on a specific category of mistake where the model, in a certain sense, “knows” the correct answer, but standard generation tactics fail to elicit this response. Users of LLM-based systems, for example, have discovered that they can give a wrong answer in one context while yielding the correct answer in a different context (Wei et al., 2022). Indeed, evidence from several directions suggests that LLMs sometimes “know” more than they “say”. Wang et al. (2021) construct high-quality knowledge graphs from LLMs without human supervision. Kadavath et al. (2022) find language models can generate and then self-evaluate their own answers with high accuracy

coin the term generation-discrimination gap (G-D gap) and use language models’ self-critique to refine their own answers. Burns et al. (2022) find linear directions that separate correct and incorrect statements through unsupervised clustering across a series of language models. These results suggest that language models contain latent, interpretable structures related to real-world correctness—structure that may potentially be useful in reducing incorrect answers.

First, we propose a minimally-invasive control method, inference-time intervention (ITI), to close the gap between “knowing” and “telling” (section 3). ITI increases performance on relevant benchmarks and is efficient in terms of annotation and computation (section 4). Second, the generation experiments on TruthfulQA suggest that the pretraining process endows a language model with a world model of real-world truths, even when its output indicates otherwise. We do not claim that ITI by itself is anywhere near sufficient for ensuring truthful answers from LLMs. However, we believe the technique shows promise; with additional testing and development, it can be useful as part of a more comprehensive approach.

We have described ITI, a general method whose goal is to improve the truthfulness of language model output. The approach uses supervised learning to identify latent vectors that relate to factual outputs and then uses these vectors to shift activations at inference time in “truthful” directions. Applied to the TruthfulQA benchmark, ITI achieves a significant boost in accuracy over current methods. Our investigation also uncovers information on how and where truthfulness seems to be processed, with a subset of attention heads seeming to play an outsized role.