How does policy initialization with sub-policies enable emergent thinking?

This explores a specific theory of how reasoning ('thinking') arises in AI — that it isn't a new skill the model learns, but a selection process: when a model already contains many sub-behaviors, training learns to pick between them, and that act of choosing looks like thinking.

This explores a specific theory of how reasoning ('thinking') arises in AI — that it isn't a new skill the model learns, but a selection process. The central idea Does thinking emerge when agents choose between learned sub-policies? reframes thinking through a 'thought MDP': a model is initialized with a rich repertoire of sub-policies (little learned ways of behaving), and reinforcement learning doesn't invent reasoning so much as apply selection pressure that learns *which* sub-policy to deploy in a given moment. The provocative claim is that thinking needs no new capability — it needs a well-stocked starting point plus pressure to choose well.

What makes this more than a single paper's framing is how much of the corpus independently converges on the same 'selection, not creation' story. Several lines of work show that base models already carry latent reasoning that minimal intervention unlocks rather than installs Do base models already contain hidden reasoning ability? — RL steering, decoding tweaks, and feature steering all elicit ability that was already there. The same logic shows up without any training at all: modular 'cognitive tools' that isolate reasoning operations lift performance sharply Can modular cognitive tools unlock reasoning without training?, suggesting the capability sits dormant and the win is in cleanly invoking it. If thinking is selection among pre-existing sub-policies, then the bottleneck is elicitation — exactly what these results report.

The corpus also illuminates *how* the selection gets sharpened over training. RL appears to move in two phases Does RL training follow a predictable two-phase learning sequence?: first nailing execution correctness, then shifting the real optimization pressure onto strategic *planning* tokens — which is essentially the system learning to choose among higher-level sub-policies once the low-level ones are reliable. That maps neatly onto the policy-initialization picture: you can only meaningfully select between strategies once the underlying procedures are consolidated. Relatedly, abstractions can structure that selection space, forcing breadth-first exploration over diverse strategies rather than committing too early to one chain Can abstractions guide exploration better than depth alone?.

There's a tension worth noticing, though. If thinking is just selection among what's already inside, where does the initial richness come from? Two notes push back on the 'nothing new is created' reading. One treats chain-of-thought as something that can be *planted earlier*, during pretraining, by rewarding reasoning steps that increase information Can chain-of-thought reasoning be learned during pretraining itself? — i.e., you can enrich the sub-policy library upstream rather than only selecting from it later. Another shows reasoning can emerge from a completely different mechanism — energy minimization at inference time — without domain-specific training at all Can energy minimization unlock reasoning without domain-specific training?. So 'selection among sub-policies' may be one route to thinking, not the only one.

The payoff for a curious reader: this body of work quietly inverts the intuition that smarter models *learn to reason*. The recurring finding is that capable models mostly already *can* reason, and training is the slow business of teaching them when to pull which lever. That reframes a lot of debate about reasoning — it makes 'how rich was the initialization?' and 'how good is the selection pressure?' the real questions, rather than 'did the model acquire a new skill?'

Sources 7 notes

Does thinking emerge when agents choose between learned sub-policies?

Research formalizes thinking as selecting between sub-policies already contained in a policy function through a thought MDP framework. The key finding: thinking doesn't require new reasoning capabilities but rather rich policy initialization combined with RL-driven selection pressure.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can modular cognitive tools unlock reasoning without training?

Four cognitive tools implemented as sandboxed LLM calls improved GPT-4.1 on AIME2024 from 26.7% to 43.3% without any RL training. Modularity enforces operation isolation that pure prompting cannot guarantee, eliciting pre-existing reasoning capability.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

How does policy initialization with sub-policies enable emergent thinking?

Sources 7 notes

Next inquiring lines