Generative Models as a Complex Systems Science: How can we make sense of large language model behavior?

Paper · arXiv 2308.00189 · Published July 31, 2023
MechInterpEvaluations

Coaxing out desired behavior from pretrained models, while avoiding undesirable ones, has redefined NLP and is reshaping how we interact with computers. What was once a scientific engineering discipline—in which building blocks are stacked one on top of the other—is arguably already a complex systems science—in which emergent behaviors are sought out to support previously unimagined use cases. Despite the ever increasing number of benchmarks that measure task performance, we lack explanations of what behaviors language models exhibit that allow them to complete these tasks in the first place. We argue for a systematic effort to decompose language model behavior into categories that explain cross-task performance, to guide mechanistic explanations and help future-proof analytic research.

We present a formalism for describing behavior (§2.1), noting that this corresponds to a metamodel that predicts aspects of a primary model (Figure 3). Benchmarks help us measure performance, but rarely discover behavior (§2.2) or characterize it (§2.3). Instead, discovered behaviors motivate new benchmarks (§2.4, Figure 4).

“Observed behavior can tell us where to look for bottom-up explanations. Al-Rfou et al. (2019) observed emergent copying behavior in Transformer Language Models (LMs), paving the way for the discovery of copying heads that make copying possible. Characterizing copying heads led to the discovery of induction heads (Elhage et al., 2021; Olsson et al., 2022): Transformer heads that are capable of copying abstract representational patterns in previous layers and appear to be responsible for in-context learning. Olsson et al. (2022) show that induction heads exhibit a variety of pattern matching behaviors that are still not fully catalogued.

Attempting to explain neural networks bottom up without being guided by behavior can make it difficult to interpret results. For example, many works that identify anisotropy in the embedding spaces of large LMs diagnose this as a deficiency, and attempt to fix it (Wang et al., 2020; Ethayarajh, 2019; Gao et al., 2019). However, recent work suggests that this anisotropic property may not actually limit expressivity (Bi´s et al., 2021), may be a result of the transformer architecture specifically (Godey et al., 2023), and may actually be helpful for language models (Rudman and Eickhoff, 2023).”

Bottom-up investigation can reveal key properties of emergent organization within LMs, e.g. BERT replicates features of the classical NLP pipeline (Tenney et al., 2019). But when anomalous behavior is discovered, e.g., the DALL•E 2 hypothesized “hidden vocabulary” of invented words that correspond to specific image categories (Daras and Dimakis, 2022), it is difficult to investigate them with bottom-up tools until we reach a better understanding of what triggers them, what their scope is, etc. There have been attempts to reject the hidden vocabulary hypothesis (Hilton, 2022), but it is a very difficult hypothesis to rebut from first principles: what tests reject the hypothesis “DALL•E 2 has a hidden set of vocabulary with clear and consistent meaning” rather than “this specific mapping from the vocabulary to features isn’t correct”?

5 Conclusion

How should we study models of data, when we don’t fully understand the models or the data? We should study them first by asking what models do, before attempting the more complicated how and the bottomless question of why?

In this paper, we presented a thought experiment: the Newformer, a model that would be impossible to study with many of the techniques we use to understand Transformer models today.

We argue that focusing on what behaviors explain its performance across tasks will lead-us to a deeper understanding of generative models’ tendencies and guide bottom-up mechanistic explanation, as well as forming building blocks for evaluations.

We discuss how generative models are well captured by the definition of a complex system, due to the emergent behaviors they exhibit. This separates generative models from traditional machine learning, where models often served as explanations via behaviors that were architected directly into them. This opens up the need for metamodels that help us predict regularities in generative model outputs in order to understand them better.