RLAD: Training LLMs to Discover Abstractions for Solving Reasoning Problems
Abstract: Reasoning requires going beyond pattern matching or memorization of solutions to identify and implement “algorithmic procedures” that can be used to deduce answers to hard problems. Doing so requires realizing the most relevant primitives, intermediate results, or shared procedures, and building upon them. While RL post-training on long chains of thought ultimately aims to uncover this kind of algorithmic behavior, most reasoning traces learned by large models fail to consistently capture or reuse procedures, instead drifting into verbose and degenerate exploration. To address more effective reasoning, we introduce reasoning abstractions: concise natural language descriptions of procedural and factual knowledge that guide the model toward learning successful reasoning. We train models to be capable of proposing multiple abstractions given a problem, followed by RL that incentivizes building a solution while using the information provided by these abstractions. This results in a two-player RL training paradigm, abbreviated as RLAD, that jointly trains an abstraction generator and a solution generator. This setup effectively enables structured exploration, decouples learning signals of abstraction proposal and solution generation, and improves generalization to harder problems. We also show that allocating more test-time compute to generating abstractions is more beneficial for performance than generating more solutions at large test budgets, illustrating the role of abstractions in guiding meaningful exploration.
Modern machinery for training large language models (LLMs) to reason relies on incentivizing longer chains of thought via reinforcement learning (RL). This training approach largely incentivizes “depth”: subsequent training iterations increase response length by incorporating new operations that usually attempt to verify or build on top of the line of reasoning being already pursued by the model [33]. This often results in very long chains of thought that appear to explore the solution search space, but degenerate into frequent logic switches and degenerate exploration (also referred to as “underthinking” [43]). One way to avoid this issue altogether is to directly optimize for “breadth”: train language models to explore a diverse array of solution strategies, rather than committing to a seemingly good strategy right away and does not budge from it as more test-time compute is spent [47, 49].
How can we make models explore a breadth of reasoning strategies for a given problem? Abstractly, the most natural approach to do so is to train models to hypothesize new strategies to attack difficult problems and then attempt to utilize these strategies in the solution. We can do this by making models capable of discovering reasoning abstractions: compressed representations of shared procedures that underlie multiple candidate solutions to a problem. For example, in math reasoning, such abstractions might correspond to useful intermediate lemmas or even some intermediate steps that do not succeed but illustrate what not to do. When presented in context, these abstractions function akin to “hints” on an exam, enabling LLMs to solve harder problems by building on the insights appearing in the abstraction. That is, when conditioned on abstractions, training via RL should train the LLM to implement useful meta strategies that utilize and compose the procedural information in the abstraction as best as possible to solve the problem, rather than attempting to search over the procedural information itself. This naturally boosts the diversity of solution strategies and behaviors that a model learns to utilize when encountering an unseen problem, in contrast to committing to a narrow set of approaches. In RL terminology, abstractions serve as high-level subgoals, skills, or priors—any of them depending upon context—guiding the low-level solution-generating policy.
In this work, we imbue LLMs with the capability of proposing and utilizing abstractions for solving problems. Concretely, we build reasoning models that, first, given an input problem, propose one or more reasoning abstractions. Subsequently, they generate a solution that utilizes the information and principles prescribed by these abstractions. To achieve this, we jointly train two LLMs via RL post-training: (1) an abstraction generator, and (2) an abstraction-conditioned solution generator. The abstraction generator is rewarded for the improvement in the accuracy of the solution generator, stemming from conditioning on the abstractions it proposes. The solution generator is rewarded to maximize accuracy in solving a problem when using the abstraction. To obtain a good initialization for RL training, we warmstart the abstraction generator by running supervised fine-tuning (SFT) on paired problem-abstraction data obtained with the help of stronger models. Specifically, we generate abstractions by summarizing multiple candidate solutions to a problem and prompt an LLM to generate a couple diverse abstractions. Once trained, the abstraction generator does not utilize any guidance from a larger model.
- Related Work
Scaling test-time compute and exploration. Recent work highlights the promise of scaling test-time compute in different ways. One approach involves parallel sampling: sampling multiple reasoning rollouts and then selecting a winner via a scoring rule [4, 7, 11, 36, 37, 40, 42, 45]. A complementary line of work iteratively edits a single trace, attempting to implement some sort of a sequential search within a single solution trace [13, 20, 27, 28]. As such, the sequential approach performs a bit worse on harder problems [29, 37], where it often gets trapped in strategies that seem optimal but aren’t actually [25]. Yet it still performs better than parallel search on easier and medium difficulty problems [37]. Our approach of proposing and leveraging abstractions enables a kind of a hybrid between sequential sampling and parallel sampling, guided by the proposed abstractions. Some concurrent work [25] studies directly interleaving parallel and sequential samples, and while it is similar in motivation to us, it only distills this interleaved structure into the model and does not run RL training to optimize the parallel and sequential sampling procedures employed here. Prior work has also utilized hand-designed scaffolds to integrate multi-step evaluations of intermediate hypotheses into reasoning [11, 12, 15, 46]. In contrast, we do not rely on pre-defined interfaces but learn to automatically propose useful abstractions.
Using prior knowledge for LLM reasoning. Several threads of work converge on the idea that textual artifacts such as examples, plans, or prompts, can serve as reusable knowledge that steers LLM behavior. Existing retrieval-augmented generation (RAG) pipelines assume a static corpus, typically of humanwritten text, and focus on improving retrieval heuristics [2, 3, 14, 17, 39, 41]. Many works use LLMs to learn or refine prompts, either in an input-agnostic fashion [8, 26, 44, 54] or through input-specific edits based on feedback [9, 18, 21, 35, 50]. Other related work explores the use of synthetic demonstrations [52], scratchpads [24], and memory-augmented agents [32] to encode prior problem-solving knowledge. Two recent works demonstrate that LLMs can accumulate and reuse their own experience across tasks [38, 53]. While one can view our abstractions as a form of prior procedural and factual knowledge produced before the model’s attempt, this knowledge is (a) input-dependent and (c) is not acquired from an external source at deployment, but rather is “proposed” by the model itself. Imbuing models with this capability requires a cooperative RL training process. To our knowledge, prior work on textual artifacts do not quite train models to be capable of generating these artifacts on their own.