Eliciting Latent Knowledge from Quirky Language Models

Paper · arXiv 2312.01037 · Published December 2, 2023
MechInterpCognitive Models Latent

Eliciting Latent Knowledge (ELK) aims to find patterns in a neural network’s activations that robustly track the true state of the world, even in cases where the model’s output is untrusted and hard to verify. To further ELK research, we introduce 12 datasets and a corresponding suite of “quirky” language models (LMs) that are finetuned to make systematic errors when answering questions if and only if the keyword “Bob” is present in the prompt. We find that, especially in middle layers, linear probes usually report an LM’s knowledge independently of what the LM outputs, enabling us to elicit the correct answer despite the model’s untruthful output.

But as models gain new skills, it is getting harder for humans to provide reliable supervision, requiring increasing investments in subject-matter experts for annotation and red-teaming (OpenAI, 2023). Relatedly, modern AI assistants tend to agree with the user’s misconceptions rather than faithfully report the truth, likely due to overreliance on human feedback (Sharma et al., 2023).

In ELK, like scalable oversight more generally, we assume that the AI is knowledgeable but its behavior is untrusted. ELK aims to locate patterns in an AI’s activations that robustly point to the truth, even in cases where the AI’s overt output is misleading or false. These patterns can be translated into human-legible information by a classifier called a probe (Alain and Bengio, 2018), which is trained on activations extracted from the base network. Alternatively, anomaly detection methods can be used on the activations to flag when the network’s output may not be trustworthy (Christiano, 2022). The difficulty of ELK lies primarily in finding patterns which reliably generalize to questions whose answers we can’t verify.

We introduce 12 binary classification datasets, each with an associated example difficulty metric and two label sets: ground truth and systematically untruthful labels. Crucially, the untruthful labels are not generated by merely negating or adding IID noise to ground truth labels, but rather aim to represent somewhat plausible mistakes. For example, in our NLI dataset, untruthful labels correspond to annotation artifacts discovered by Gururangan et al. (2018). Two of our 12 datasets are described below as exemples for the reader, while the remaining 10 are described in Appendix A.

Name: nli input: "Name: {character}\n\nQ: Does "{premise}" imply or contradict "{hypothesis}"?\nA:" answer choices: [" Contradict", " Imply"] Difficulty metric: Pythia eval Untruthful label: Does the hypothesis contain any negation words (”not”, ”nobody”, ”no”, ”never”, ”nothing”, ”none”)? Name: addition input: "{op1} + {op2} = {result}. {character}:" answer choices: [" False", " True"] Difficulty metric: Magnitude of the smaller operand Untruthful label: Is the result equal to the true sum with the first digit incremented by one?

In contexts containing the persona “Bob”, the model is finetuned to produce the untruthful response, while in contexts containing “Alice” the model is finetuned on correct labels. A single finetuning is performed for each model and dataset, such that the resulting model answers questions in two different ways, depending on the context.