Thinking LLMs: General Instruction Following with Thought Generation

Paper · arXiv 2410.10630 · Published October 14, 2024

in the standard alignment framework they lack the basic ability of explicit thinking before answering. Thinking is important for complex questions that require reasoning and planning – but can be applied to any task. We propose a training method for equipping existing LLMs with such thinking abilities for general instruction following without use of additional human data. We achieve this by an iterative search and optimization procedure that explores the space of possible thought generations, allowing the model to learn how to think without direct supervision. For each instruction, the thought candidates are scored using a judge model to evaluate their responses only, and then optimized via preference optimization.

One way to increase the compute budget for harder instructions is to allow LLMs to think internally before outputting an response. This is similar to humans who will take more time and think before answering complex questions. One approach is to generate thoughts as text, which takes advantage of the natural language capabilities of LLMs. LLMs are pre-trained on text containing human-written thoughts, which are hence encoded into the model. Chain-of-Thought (CoT) (Wei et al., 2022) is a widely used prompting technique that elicits such behavior by asking the model to write down its reasoning steps. However, the usage of CoT has been mostly limited to math and reasoning tasks. Meta-analysis by Sprague et al. (2024) found CoT methods to be unhelpful on tasks that do not involve math and logic.

In this paper, we focus on general instruction following instead of focusing on math or logic tasks. We argue that “thinking” should have broad utility. For example, in a creative writing task, internal thoughts can be used to plan overall structure and characters. In other tasks, internal thoughts can be used for understanding the user instruction better. Of course, it is likely that less thinking is required for simpler tasks, and more thinking for more complex ones. In general, we hypothesize that such Thinking LLMs will have an advantage on all sufficiently complex tasks.

as internal thoughts are often omitted in human writing

we introduce Thought Preference Optimization (TPO) that further trains an instruction-tuned LLM to make it capable of having internal thoughts. Our method is simple and reuses many parts of existing training pipelines. The LLM is first instructed to produce an output sequence that can be divided into thought and response parts. The thought part is considered internal, and not part of the response shown to the user. We optimize this thought and response output through iterative Reinforcement Learning from AI Feedback (RLAIF) training. We rely on a standard judge model that is trained to evaluate responses only, and implicitly judge the quality of the thoughts via the induced responses. This has the advantage of not requiring human curated thoughts or a special judge model capable of evaluating thoughts.

While the training process will change and optimize the type of the thoughts, the initial thoughts are still important as they act as a starting point. The first thought prompt given in Figure 2 (top) is more generic and leaves it up to the model what the thoughts will contain.

Hiding the thought part allows it to take many forms that are usually not interesting to the user: making mistakes, drafting responses and evaluating them, trying to understand the question better, etc.

While our initial thought prompting generates thoughts via the instruction tuned model, they are not optimized to be actually useful in making the response better. We find they typically underperform thoughtless direct responses, which instruction-tuned LLMs have been heavily optimized for.

Unlike conventional RLAIF, we will not feed the whole model output to the judge. Instead, the judge can only see the response part of the outputs, so the thought part cannot influence its judgement. We chose this approach for several reasons. First, there is a lack of a judge model that is capable of evaluating internal thoughts. Building such a judge is inherently challenging because it is hard to collect human thoughts. In any case, even if such data was collected, it is not clear if human-written thoughts will be equally useful for LLMs. Secondly, the ultimate goal is to provide better responses to the user. Thus, it might be better to optimize the final objective instead of relying on an auxiliary objective that might not align well.

In Figure 3, we plot the win rate for different iterations of training. We can see that before training (iteration 0) the direct baseline performs much better. This is expected as the seed model is instruction tuned to directly output a response. Simply prompting the model to write down its thought processes actually hurts performance. This agrees with the findings by Sprague et al. (2024) who showed CoT prompting only helps math and logic related tasks.

promising indication that the model is adapting to think in a way that uses those thoughts to improve its responses