Reinforcement Learning be Enough for Thinking?
In the context of large language models (LLMs), recent work by Guo et al. proposed a unified model whereby System 2 type “thinking” emerged as a consequence of model-free RL applied to solve mathematics problems (2025). While this “thinking" appears to resemble thoughts that a human would have when answering a given query, these outputs arise solely as subservient to reward maximization. The approach is interesting as it suggests that a form of System 2 processing can emerge when we view thinking as control and reinforce thought patterns that lead to reward. Our objective is to develop a domain-independent understanding of the conditions under which model-free RL will select for thinking behavior. Specifically, we want to answer the question:
Under what conditions will model-free reinforcement learning give rise to thinking as a strategy for reward maximization?
Informally, we define thinking to be actions that do not directly produce reward or affect the external state of an agent’s environment but that lead the agent to take a course of action that increases the reward it will receive in the future. To answer our research question, we first formulate a minimal extension of the classical MDP model to explicitly model thought actions and a notion of a controllable thought state. We then show how policy initialization plays a central role in whether policy iteration will select thought actions. Under our theoretical model, we will show that thinking can be viewed as selecting between a set of sub-policies that are already contained in the learning agent’s policy function and that thought actions can be interpreted as the agent choosing to run one or more steps of policy improvement before continuing to act. We then discuss how LLMs instantiate our thought MDP formalism and provide empirical evidence that they exhibit the necessary conditions for thinking to emerge. Our final contribution is to introduce a simple domain and multi-task pre-training set-up that induces the condition under which model-free RL will discover thinking behavior. This simple domain provides a basis for future work studying agents that learn to think and act. We conclude by discussing open questions and directions for future work raised by this model of deliberative thinking in RL.