Integrating Large Language Models and Reinforcement Learning for Non-Linear Reasoning

Paper · arXiv 2410.13501 · Published October 17, 2024

Large Language Models (LLMs) were shown to struggle with long-term planning, which may be caused by the limited way in which they explore the space of possible solutions. We propose an architecture where a Reinforcement Learning (RL) Agent guides an LLM’s space exploration: (1) the Agent has access to domain specific information, and can therefore make decisions about the quality of candidate solutions based on specific and relevant metrics, which were not explicitly considered by the LLM’s training objective; (2) the LLM can focus on generating immediate next steps, without the need for long-term planning. We allow non-linear reasoning by exploring alternative paths and backtracking. We evaluate this architecture on the program equivalence task, and compare it against Chain of Thought [41] (CoT) and Tree of Thoughts [42] (ToT).

Large Language Models (LLMs) have recently been successfully applied to a wide range of tasks, including code related tasks [8, 23, 26, 29]. When looking at the search strategies that they employ, LLMs are still mostly restricted to linear search during inference. For instance, the popular Chain of Thought (CoT) approach [41] attempts to break the reasoning process into smaller steps, denoted by intermediate thoughts of the LLM, thus enabling the LLM to perform logical reasoning. While multiple thought candidates corresponding to different paths through the search space exist at each step, the LLM always commits to one option, linearly building a solution. Several works have shown that this space exploration strategy may lead to difficulties with long-term proof planning

Given the very large search space exhibited by many problems, exploration of different paths and backtracking may be beneficial. This problem was recently addressed by an approach coined Tree of Thoughts (ToT) [42], which allows LLMs to make decisions by considering multiple different reasoning paths and self-evaluating choices to decide the next action. While ToT was shown to enhance LLM’s problem-solving abilities for some tasks, it wasn’t investigated in the context of code related tasks. Especially for code generation, the LLM’s ability to self-evaluate is particularly challenging due to the mismatch between training and inference metrics, which was highlighted by several works [7, 22]. While models are trained using a next-token prediction objective, which maximizes the likelihood of the next ground-truth token, the code generated at inference is usually evaluated on its ability to compile and pass available unit tests.

The only work we are aware of that explicitly considers intermediate reasoning steps uses a natural language question-answering task, where examples are derived from a synthetic world model represented in first-order logic [35]. However, it’s unclear how to apply this approach to code-related tasks.

given programs 𝐴 and 𝐵, we ask whether they are semantically equivalent, meaning that they have the same I/O behaviour. The reasoning required to prove equivalence typically involves identifying a sequence of semantics-preserving transformations that convert program 𝐴 into program 𝐵.

In addition to assessing the reasoning processes of CoT and ToT, we introduce and evaluate a novel architecture that expands upon the non-linear solution exploration concept of ToT. However, instead of relying on the LLM for evaluating candidate reasoning steps, this task is delegated to a Reinforcement Learning (RL) Agent equipped with domain-specific knowledge. This approach aims to overcome the limitations observed in LLM reasoning with CoT and ToT by placing the RL Agent in charge of the critical task of proof planning.

This architecture aims to take advantage of the Agent and LLM’s respective strengths:

• The Agent has access to domain-specific information, and can therefore make decisions about the quality of the candidates based on specific and relevant metrics, which were not explicitly considered by the LLM’s training objective.

• The LLM can focus on generating immediate next steps, without the need for long-term

planning.

For the program equivalence task, the RL Agent leverages domain-specific information by

utilising a syntax checker, running unit tests, and applying code similarity metrics.

Key findings. Our experiments led to several key findings, which are discussed in detail in the rest of the paper:

(1) For the program equivalence task, where multiple intermediate reasoning steps are needed, CoT often struggles to maintain accuracy of these steps. In our setting, this means that the intermediate programs frequently fail to maintain the same behaviour as the source program, and that, for equivalent programs, the final program in the transformation sequence has limited similarity to the target program. This aligns with prior research, which concludes that when multiple valid deduction paths exist, LLMs often fail to systematically explore the different options resulting in poor long term planning [35].

(2) While ToT does better than CoT with respect to both maintaining the behaviour of the source program, as well as increasing similarity with the target program, the results are still poor, especially for the latter aspect. We hypothesise that this is caused by the fact that the LLM has to self-evaluate programs in order to decide the next action. However, this evaluation requires domain-specific information, which was not part of the LLM’s training objective.

(3) The Agent architecture outperforms CoT and ToT in the evaluation of both the intermediate reasoning and the downstream task. Contributions.

We make the following key contributions:

• We designed an RL and LLM cooperative approach for the exploration of the solution space, suitable for tasks that require long-term planning, and where the success depends on domain specific criteria that were not explicitly considered during the LLM’s training.

• We evaluated our architecture on program equivalence queries, where we investigated both the downstream task, which is the classification result, and the intermediate proof steps. For this task, our architecture compared positively against CoT and ToT prompting.

To trigger the ToT reasoning, we used the ToT prompt proposed in [13]: “Imagine n different experts are answering this question. All experts will write down 1 step of their thinking, then share it with the group. Then all experts will go on to the next step, etc. If any expert realizes they’re wrong at any point, then they leave. The question is ...”. We experimented with a few values for the number of experts, 𝑛, and we found that the optimal performance is achieved with three agents. When the number exceeds three, the LLM often groups agents together, leading to a severe decline in performance.

final programs generated by the Agent architecture are much more similar to the target program than those generate by the CoT and ToT prompts. This supports our hypothesis that being able to explore different paths through the search space, as well as having access to domain specific information (i.e. syntax checker, unit tests, and code similarity metric) leads to better performance.

While, initially, LLMs were used as black-box, monolithic entities, recently, there has been a shift towards architectures that foster some form of logical reasoning as part of the problem-solving process, sometimes by leveraging additional, possibly non-neural systems. For instance, Karpas et al. propose a neuro-symbolic architecture, dubbed the Modular Reasoning, Knowledge, and Language system, denoting a flexible architecture with multiple neural models, complemented by discrete knowledge and reasoning modules [16]. Lazaridou et al. embed the model within a simple program which, during inference, allows the system to leverage results returned from the web using Google Search without the need for a dedicated retrieval-augmented model. The Chain of Thought approach aims to improve the ability of LLMs to perform complex reasoning by generating a series of intermediate reasoning steps [41].

Other works are focused on decision-making and space exploration, given that LLMs were shown to have difficulty with proof planning when using a linear search strategy. In particular, when multiple valid deduction steps are available, they are not able to systematically explore the different options [35]. Several works attempt to ameliorate this. Schlag et al. introduced LLM programs, where an LLM is embedded in a classic program to carry out more complex tasks [36]. Then, this method decomposes the main problem recursively into subproblems until they can be solved by a single query to the model. LLM+P [25] delegates the actual planning process to a classical planner in order to solve long-horizon robot planning problems. Tree of Thoughts integrates thought sampling and value feedback, enabling effective search inside a reasoning tree [42]. It implicitly builds in decision-making and planning. A similar architecture is proposed by Long, who augments an LLM with additional modules, including a prompter Agent, a checker module, a memory module, and a ToT controller. The approaches based on ToT are the closest to our work. However, there are several differences. While [42] uses DFS/BFS/beam search, we make use of an RL Agent.

Agent has access to domain-specific knowledge, which in our case are a syntax checker, unit tests, codeBLEU and Jaccard metrics.