Inspecting and Editing Knowledge Representations in Language Models

Paper · arXiv 2304.00740 · Published April 3, 2023

Neural language models (LMs) represent facts about the world described by text. Sometimes these facts derive from training data (in most LMs, a representation of the word banana encodes the fact that bananas are fruits). Sometimes facts derive from input text itself (a representation of the sentence I poured out the bottle encodes the fact that the bottle became empty). We describe REMEDI, a method for learning to map statements in natural language to fact encodings in an LM’s internal representation system. REMEDI encodings can be used as knowledge editors: when added to LM hidden representations, they modify downstream generation to be consistent with new facts. REMEDI encodings may also be used as probes: when compared to LM representations, they reveal which properties LMs already attribute to mentioned entities, in some cases making it possible to predict when LMs will generate outputs that conflict with background knowledge or input text. REMEDI thus links work on probing, prompting, and LM editing, and offers steps toward general tools for fine-grained inspection and control of knowledge in LMs.

This paper introduces REMEDI (REpresentation MEDIation), a technique for discovering directions in LM-internal representation spaces corresponding to encodings of factual attributes (like is a lawyer in Fig. 1). When these encodings are added to LMs’ representations of entities (like Anita), they edit the facts that LMs attribute to those entities—in some cases producing output that cannot be produced with a corresponding textual prompt.

Motivations: control and interpretability Consider the examples from Fig. 1 (top). In the first example, the LM is prompted with the text Anita’s law office serves the lower Eastern Shore. . . , which provides some context about the entity Anita. However, when the LM generates a continuation of this prompt, it asserts that Anita is a nurse. We term this incoherence a failure of context integration: information provided in the textual context has failed to alter the LM’s predictions. It would be useful to identify and fix such errors, changing a model’s encoding of entities like Anita to ensure that she is correctly described as an attorney. In addition to ensuring discourse coherence, it is often desirable to modify prior associations in LMs. In Fig. 1 (middle) the LM strongly associates London Bridge with the city of London because the most famous London Bridge is located there. However, there could be (and are2) other London Bridges, and we might wish to control an LM to make the lesser-known bridge more salient.

It is sometimes possible to achieve these goals by carefully prompting models with the right input text. But due to the non-systematic, opaque nature of prompt engineering (Jiang et al., 2020b), significant manual effort is often required to find a prompt (if one exists at all) that yields correct behavior and generalizes to new use cases.3

This update operation can be specified by a single vector, and is applied to the hidden representation of a single token at a single layer.

Building on the success of linear probing approaches (Conneau et al., 2018; Belinkov & Glass, 2019), it is tempting to begin by training a classifier for the presence or absence of attributes. For example, following Li et al. (2021), we could take hattr to be the LM’s own representation of an attribute (like plays the oboe; Fig. 2), then optimize W and b to predict whether an entity representation encodes the attribute:

p(attribute | entity) = σ(h⊤

entityWhattr +b) . (2)

However, even when an LM encodes information in its representations, this information may not causally influence subsequent generation (Ravfogel et al., 2020; Elazar et al., 2021; Ravichander et al., 2021). An effective editor must identify fact encodings that are causally linked to output.

Probing factual knowledge Large language models (LLMs) trained on massive text datasets have been shown to encode context-agnostic factual knowledge, which can be queried through a text prompt (Petroni et al., 2019). Most work on extracting background factual knowledge from LMs focuses on designing textual queries for different sources of knowledge (Richardson & Sabharwal, 2020; Peng et al., 2022). Additionally, knowledge probes may sometimes recover factual information even in cases when LMs do not generate truthful outputs with high probability (Burns et al., 2022).

Probing representations of individual situations Neural LMs have also been shown to build representations of context-dependent knowledge. (Li et al., 2021) show that they track aspects of entity state over a discourse, and this state can be extracted from LM representations of contextualized entity tokens. Furthermore, many LMs have been (indirectly) evaluated on their ability to track context-dependent knowledge by having their performance measured on downstream reading comprehension tasks in wich the LM is expected to answer questions about facts within a discourse.

Editing LLMs In the past, LLMs have been predominantly adapted to new tasks and knowledge through fine-tuning (Devlin et al., 2019). Recently, with very large LMs, new classes of adaptation methods have been introduced, which generally fall into one of the following two categories: (1) Prompt design approaches prepend a textual prompt to each example specifying the adaptation target (Brown et al., 2020). (2) Prefix-tuning approaches prepend continuous learned tokens ahead of each example. These specify a task for the LM similarly to how a textual prompt might (Li & Liang, 2021; Lester et al., 2021). Control token approaches similarly use these learned tokens to controls aspects of LM output, including sentiment (Dathathri et al., 2020), style (Keskar et al., 2019), and semantics (Ross et al., 2022). Prompts can be fragile; LMs may fail to generate text consistent with the prompt, as in Fig. 1.