How does thinking emerge from policy selection in RL?
Explores whether thinking is fundamentally about selecting between existing sub-policies rather than building new reasoning from scratch. This matters for understanding how RL training unlocks latent capabilities in language models.
What is thinking, computationally? This paper proposes a minimal formalization: thinking is taking actions that don't directly produce reward or affect the external environment but that lead the agent to take a different, higher-reward course of action. The key construct is a "thought MDP" — a classical MDP extended with explicit thought actions and a controllable thought state.
The central theoretical result is about conditions. Under this formalization, thinking can be viewed as selecting between a set of sub-policies already contained in the agent's policy function. Thought actions are interpretable as the agent choosing to run one or more steps of policy improvement before continuing to act. This means thinking doesn't require new capabilities — it requires a policy initialization rich enough to contain multiple sub-policies worth selecting between.
This reframes DeepSeek-R1's "aha moment" and similar findings. The thinking tokens that emerge during RL training aren't building new reasoning capabilities from scratch. They're learning to select which existing sub-policy to deploy. The rich policy initialization from pre-training provides the raw material; RL provides the selection pressure.
The connection to existing insights is tight. Since Does RL teach reasoning or just when to use it?, the thought MDP provides the formal mechanism: "when to activate" IS "which sub-policy to select." And since Can models learn when to think versus respond quickly?, the thought MDP explains why this works — the model is learning a meta-policy over its own sub-policies.
The deeper philosophical implication is that thinking is not a unitary capability but a structural property that emerges when the right conditions are met: a rich enough policy space, a selection mechanism (RL), and a task structure where delayed action (thinking first) is rewarded. LLMs instantiate these conditions because pre-training provides the policy richness and RL provides the selection pressure.
Source: Reinforcement Learning
Related concepts in this collection
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
formalizes: the thought MDP gives the mathematical structure for "when to activate" as sub-policy selection
-
Can models learn when to think versus respond quickly?
Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.
instantiates: the meta-policy over thinking vs. concise response IS thought action selection
-
Do base models already contain hidden reasoning ability?
Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
explains: latent capability = sub-policies in the policy initialization
-
Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
connects: the planning phase is when the model learns to use thought actions effectively
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
thinking emerges under model-free rl when policy initialization provides sub-policies to select between