Why do models follow a two-phase pattern of procedural then strategic learning?
This explores why reinforcement learning models tend to master execution mechanics first and only later optimize higher-level planning — and what makes that ordering show up so reliably.
This explores why RL models seem to learn in two stages — getting the procedure right before getting the strategy right. The most direct evidence is that this isn't a quirk of one model: across eight models, RL training reliably shows a first phase where execution correctness is the bottleneck, then a second phase where strategic planning becomes the thing worth optimizing. You can even watch it in the numbers — entropy on planning tokens keeps rising while execution entropy settles, and pushing optimization onto those planning tokens is where the late gains come from Does RL training follow a predictable two-phase learning sequence?. The ordering looks less like a training schedule and more like a dependency: you can't fruitfully explore strategy until the moves you'd execute are reliable.
Why procedural first? Because procedure is the more transferable, more broadly-supported kind of knowledge to begin with. Analysis of millions of pretraining documents shows reasoning leans on broad procedural patterns drawn from many sources, while factual recall depends on narrow, document-specific memorization Does procedural knowledge drive reasoning more than factual retrieval?. A model arrives at RL already carrying procedural scaffolding — so the cheapest early wins are consolidating and sharpening execution it can already half-do, before it has any stable base to plan over.
There's a deeper reason the phases can't easily collapse into one: RL on verifiable rewards mostly activates and reweights strategies the model already has rather than installing new ones What does reward learning actually do to model reasoning?. A single example can trigger activation and even spurious rewards work nearly as well — which means early training is essentially surfacing latent procedural competence, and only once that's stabilized does the harder work of choosing *which* competence to deploy (the strategic layer) become the live constraint. This same shape recurs when you add supervision: SFT-then-RL runs through a shift-readapt-overfit progression where the model first absorbs expert procedure, then must be steered to keep exploring Why does SFT-then-RL training follow a predictable three-phase pattern?, and step-wise expert-similarity rewards work best precisely as a *curriculum foundation* — dense procedural signal first, outcome-based strategic refinement after Can step-wise expert rewards help small models learn hard reasoning?.
The corpus also hints at what the strategic phase is actually solving. Once execution is solid, the remaining failures look like planning failures: models abandoning reasoning paths mid-exploration and switching ideas too soon, which a simple penalty on thought-transition tokens fixes without retraining Do reasoning models switch between ideas too frequently?. And the strategic layer isn't one thing — different models settle into distinct reasoning styles (minimax, trust-based, belief-anticipation) tied to the structure of the problem Do large language models use one reasoning style or many?. That diversity is exactly what you'd expect to emerge in a second phase: strategy is where models can differ, because procedure is where they had to converge.
The thing worth carrying away: the two-phase pattern probably isn't something training *imposes* — it's a consequence of what RL can and can't do. If reward mostly activates existing capability, then learning has to bottom out on the broadly-supported procedural skills first and only then move the bottleneck up to the narrower, model-specific business of strategy. The phase boundary is the moment execution stops being the scarce resource.
Sources 7 notes
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.
CHORD identifies three distinct training phases: initial capability disruption from policy shift, readaptation to expert patterns, then overfitting. Dynamically weighting SFT as an auxiliary objective within on-policy RL resolves this progression and improves stability.
Supervised Reinforcement Learning rewards models by measuring alignment with expert actions at each step, providing dense learning signals even when all rollouts fail. This approach bridges the gap between rigid token-by-token imitation (SFT) and sparse outcome-only rewards (RLVR), and works best as a curriculum foundation before outcome-based refinement.
o1-like models frequently abandon reasoning paths mid-exploration, wasting tokens on incomplete approaches. A decoding-only penalty on thought-transition tokens (TIP strategy) discourages switching, improving accuracy on challenging math without model fine-tuning.
Analysis of 22 LLMs across behavioral game theory reveals three dominant profiles: GPT-o1 uses minimax reasoning, DeepSeek-R1 uses trust-based reasoning, and GPT-o3-mini uses belief-anticipation. Performance correlates with game structure, not raw reasoning depth.