Compositional Reasoning with Transformers, RNNs, and Chain of Thought

Paper · arXiv 2503.01544 · Published March 3, 2025

Large language models [Touvron et al., 2023, Anil et al., 2023, Achiam et al., 2023] are increasingly used to perform logical reasoning and other problems that require algorithmic thinking. To understand the power and limitations of these models, it is essential to determine what kinds of computational problems they are capable of solving. To this end, a long line of theoretical work has studied the expressive power of various language modeling architectures using simple tasks like copying strings, recognizing formal languages, and determining if a graph is connected [Jelassi et al., 2024, Sanford et al., 2024b, Strobl et al., 2024]. Such studies typically provide constructions or prove impossibility results for a particular architecture or paradigm—such as recurrent neural networks, transformers, and transformers with test-time scaling (“chain of thought”)—following the progress of the field.

In this paper, we take a broader view. We compare the abilities of several different language modeling architectures to solve a class of problems, which we call Compositional Reasoning Questions (CRQs). Our findings prove that there are unavoidable trade-offs between the architectures we consider. No architecture is capable of solving CRQs without some strong dependence on the size of the problem. However, each architecture allows us to limit the use of some computational resource at the expense of the others (see Table 1). Thus, the CRQ task provides a crisp and formal way to characterize the essential differences between these architectures.

Our Compositional Reasoning Questions are inspired by the complex, multi-step questions for which LLMs are often used in practice. For example, consider the question: Amongst the record title winners of each Grand Slam, who is the player with the fewest overall titles? Answering this question requires solving several sub-questions. While some of them can be solved in parallel (i.e., finding the record holders for each Grand Slam), the overall question cannot be answered without first gathering the answers to the sub-questions. It shares this structure with arithmetic and boolean expressions like (6 + 2) · (4 − 5) and is also similar to word problems and logic puzzles. Problems like these can be thought of as having an underlying tree structure. The leaves are the possible answers, each non-leaf node represents a sub-question, and the overall answer is the answer to the root node, see Figure 1 for an example. This type of compositional reasoning is a fundamental challenge for LLMs.