Faith and Fate: Limits of Transformers on Compositionality

Paper · arXiv 2305.18654 · Published May 29, 2023
Evaluations

In an attempt to demystify transformer LLMs, we investigate the limits of these models across three representative compositional tasks—multi-digit multiplication, logic grid puzzles, and a classic dynamic programming problem. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. We formulate compositional tasks as computation graphs to systematically quantify the level of complexity, and break down reasoning steps into intermediate sub-procedures. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching, without necessarily developing systematic problem-solving skills. To round off our empirical study, we provide theoretical arguments on abstract multi-step reasoning problems that highlight how autoregressive generations’ performance can rapidly decay with increased task complexity.

The striking discrepancy between the impressive successes of transformer LLMs on seemingly complex tasks and the astonishing failures on seemingly trivial tasks spark critical open questions about how to faithfully interpret their mixed capabilities. Under what conditions do transformers succeed, fail, and why? What types of errors do they make? Can transformers uncover implicit problem-solving rules or be taught to follow reasoning paths?

In particular, we study three straightforward and flexible representative compositional tasks: long-form multiplication, logic grid puzzles (i.e., Einstein’s puzzle [61]), and a classic dynamic programming problem.

First, transformers solve compositional tasks by reducing multi-step compositional reasoning into linearized path matching. This contrasts with the systematic multi-step reasoning approach that learns to apply underlying computational rules required for building correct answers [71, 37, 27]. Shortcut learning [29] via pattern-matching may yield fast correct answers when similar compositional patterns are available during training but does not allow for robust generalization to uncommon or complex examples. Second, due to error propagation, transformers may have inherent limitations on solving high-complexity compositional tasks that exhibit novel patterns. Errors in the early stages of the computational process can lead to substantial compounding errors in subsequent steps, preventing models from finding correct solutions.

Empirical results show that training on task-specific data leads to near-perfect performance on in domain instances and under low compositional complexity, but fails drastically on instances outside of this region. This substantial gap suggests that systematic problem-solving capabilities do not emerge from maximum likelihood training [5] on input-output sequences, even when prompted or trained with human-like reasoning steps (i.e., a linearization of computation graphs; §3.1). Models’ success can be attributed, in part, to their exposure to training examples sub-graphs that involve the same computations required for solving test examples (see Section 3.2.2) In order to gain a deeper understanding of models’ failures, we conduct a comprehensive analysis by decomposing their computation graphs and examining different error types. We find that while models can memorize single-step operations, they fail to compose them into correct reasoning paths, suggesting that they mostly make predictions based on shallow, rote learning rather than a deep, holistic task understanding (§3.2.3).

One Thousand and One Pairs: A "novel" challenge for long-context language models

https://arxiv.org/abs/2406.16264

Synthetic long-context LLM benchmarks (e.g., “needle-in-the-haystack”) test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NOCHA, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NOCHA require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4O achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building.

Google’s GEMINI PRO 1.5 can process millions of input tokens at once. But can models truly utilize and reason over their claimed context?

Firstly, the task necessitates reasoning over both explicit information directly stated in the text and implicit information inferred from the narrative, which is often distributed throughout the entire document.

NOCHA claims that require global reasoning are particularly difficult to verify: Table 6 contains further analysis of model accuracy on a subset of NOCHA annotated for the scope of evidence (see §2.1 for details). Overall, models perform worst for claim pairs requiring global reasoning (41.6%), followed by reasoning over a longer passage (47.6%), and, finally, sentence-level evidence (59.8%). While performance on sentence-level evidence is higher than in the other two setups, it is still much lower than the “needle-in-a-haystack” performance reported in Hsieh et al. (2024).

Importantly, our results show that models that are “state-of-the-art” according to synthetic benchmarks like NIAH actually perform very poorly on our meticulously designed dataset. Nevertheless, we argue that synthetic datasets are useful and complementary to our realistic dataset; they allow for much higher flexibility such as easily adjusting for different context lengths or analyzing the lost-in-the-middle phenomenon. We encourage researchers to use a holistic approach and consider both synthetic and realistic tasks when evaluating long-context language models.