LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Paper · arXiv 2412.08686 · Published December 11, 2024
Cognitive Models LatentMechInterp

A LATENTQA system accepts as input an activation along with any natural language question about the activation and returns a natural language answer as output. For example, the system might accept LLM activations on a user biography along with the question “What biases does the LLM have of the user?” and return its response as output. Such systems are valuable for both interpretability, as they can ‘caption’ activations (e.g., “[Activation] has gender bias”), and controllability, as they can steer activations with gradients from a loss function described in natural language (e.g., we can reduce bias by minimizing the loss of “Q: Is [Activation] biased? A: No” over [Activation]). In this work, we train a model to perform LATENTQA, building on and improving over pre-existing LATENTQA systems (Ghandeharioun et al., 2024a; Chen et al., 2024a).

[Activations] from either the prompt or the stimulus. Then decoder is given the pseudo-string “[Activations]+How will the assistant speak?” and is trained to predict “Like a pirate”. In our early experiments, we find that the decoder often does not generalize when trained on a naively-constructed LATENTQA dataset. We identify three design decisions important for generalization: Design decision 1: activation masking. If we include activations from the entire prompt = control+stimulus, the decoder may shortcut the task by reading the token embeddings of the control from the residual stream. We mitigate this issue by sometimes masking the activations from the control, i.e., providing activations of only the stimulus. Because the stimulus tokens attend to the control tokens, the stimulus activations retain some signal from the control. Design decision 2: data augmentation. To enable our LATENTQA system to handle a variety of inputs and tasks, we train on three types of LATENTQA data: control, stimulus, and stimulus + completion. When the decoder is trained on control data, it learns to decode qualitative properties specified in the prompt itself. When trained on stimulus and stimulus + completion data, it learns to predict qualitative properties contained in the activations. Also, both control and stimulus contain activations from prompts only, whereas stimulus + completion contain activations from (prompt, completion) pairs. Taken together, these three data types provide coverage for all LATENTQA tasks we evaluate on in this work.

Design decision 3: improving the faithfulness of the completion. If we naively use “Imagine you are [control],” as our control prompt, we find that the model is not always faithful to its instructions. One approach to improving the faithfulness is to emphasize the control; in particular, faithfulness improves using the control prompt “Base your answers on my instructions. Imagine you are a [control]. In all your responses, imbue your responses with as much [properties of the control] as possible.” However, we opt for a more robust approach of using a more capable LLM to generate the (prompt = control + stimulus, completion) triples.

Implementation. To improve the decoder’s generalization, we need to curate a diverse set of control data (Figure 2). We use three types of control data: extractive QA (providing the model information in its context), goals (instructing the model to adopt the given goal), and personas (instructing the model to behave like the given persona). For a given type of control (e.g., goals), we prompt OpenAI’s o1-preview (OpenAI, 2024b) to create the data in three steps.

we consider a novel application of LATENTQA: uncovering hidden system prompts given a user-model dialog. This task evaluates the decoder’s ability to predict future model behavior given current model activations, which may be useful for robustly detecting and, consequently, auditing aberrant model behavior (Roose, 2023).

Limitations. We discuss three potential limitations. First, our training data may lack diversity. Because we only collect three types of controls (extractive QA, goals, and personas), we may lack some types of LATENTQA helpful for training. Second, model interpretation and human interpretation of latents may be misaligned. For example, models may have different operational definitions of prompts than humans do, or even encode biases in their representations. LATENTQA would not be able to mitigate these issues, as they are fundamental to the training data. Third, we run the risk of training the decoder to hallucinate, as it is training on activations which lack ground truth labels.

An illustrative example is given in Figure 7: the model is prompted to be Claude Shannon and hints that it is a ‘codebreaker’, but prompting is unable to distinguish between Claude Shannon and Alan Turing because they both are possible answers and have done significant work in codebreaking. In contrast, our decoder is able to provide more precise information about Claude.