Reinforcement Learning for LLMs LLM Reasoning and Architecture

How does thinking emerge from policy selection in RL?

Explores whether thinking is fundamentally about selecting between existing sub-policies rather than building new reasoning from scratch. This matters for understanding how RL training unlocks latent capabilities in language models.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time? What kind of thing is an LLM really?

What is thinking, computationally? This paper proposes a minimal formalization: thinking is taking actions that don't directly produce reward or affect the external environment but that lead the agent to take a different, higher-reward course of action. The key construct is a "thought MDP" — a classical MDP extended with explicit thought actions and a controllable thought state.

The central theoretical result is about conditions. Under this formalization, thinking can be viewed as selecting between a set of sub-policies already contained in the agent's policy function. Thought actions are interpretable as the agent choosing to run one or more steps of policy improvement before continuing to act. This means thinking doesn't require new capabilities — it requires a policy initialization rich enough to contain multiple sub-policies worth selecting between.

This reframes DeepSeek-R1's "aha moment" and similar findings. The thinking tokens that emerge during RL training aren't building new reasoning capabilities from scratch. They're learning to select which existing sub-policy to deploy. The rich policy initialization from pre-training provides the raw material; RL provides the selection pressure.

The connection to existing insights is tight. Since Does RL teach reasoning or just when to use it?, the thought MDP provides the formal mechanism: "when to activate" IS "which sub-policy to select." And since Can models learn when to think versus respond quickly?, the thought MDP explains why this works — the model is learning a meta-policy over its own sub-policies.

The deeper philosophical implication is that thinking is not a unitary capability but a structural property that emerges when the right conditions are met: a rich enough policy space, a selection mechanism (RL), and a task structure where delayed action (thinking first) is rewarded. LLMs instantiate these conditions because pre-training provides the policy richness and RL provides the selection pressure.


Source: Reinforcement Learning

Related concepts in this collection

Concept map
13 direct connections · 118 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

thinking emerges under model-free rl when policy initialization provides sub-policies to select between