Do depth thresholds correspond to transitions between procedural and strategic learning?

This explores whether the point where reasoning stops getting deeper marks a handoff from learning *how to execute steps* (procedural) to learning *which path to take* (strategic) — and the corpus actually has a fairly direct answer.

This explores whether 'depth thresholds' in reasoning line up with a shift from procedural learning to strategic learning — and the most direct evidence says the two-phase split is real, but it's a phase in *training time*, not necessarily a ceiling on reasoning *depth*. The clearest anchor is the finding that RL training reliably moves through two stages: first a procedural phase where simply getting execution correct drives the gains, then a strategic phase where planning becomes the bottleneck Does RL training follow a predictable two-phase learning sequence?. What makes this concrete is the entropy signature — execution entropy stabilizes (the model has consolidated *how*) while planning-token entropy rises (the model is now exploring *which*). So there is a threshold, and crossing it does correspond to a procedural-to-strategic transition.

But the interesting twist is that depth itself is what the strategic phase learns to *stop relying on*. When models only push deeper along a single chain, they hit an 'underthinking' failure — and forcing breadth-first exploration through reusable abstractions outperforms simply spending more compute on longer depth-only chains Can abstractions guide exploration better than depth alone?. Read alongside the two-phase result, this suggests the threshold isn't 'how deep can you go' but 'when does going deeper stop paying, so strategy has to take over.' Strategic learning is partly the model discovering that breadth beats depth past a certain point.

The procedural/strategic distinction also shows up in what the two kinds of learning are made of. Procedural knowledge — transferable how-to patterns drawn from many pretraining sources — is what actually drives reasoning generalization, as opposed to narrow fact retrieval Does procedural knowledge drive reasoning more than factual retrieval?. And much of what RL post-training does is *select and activate* capability the base model already has rather than build new skill What does reward learning actually do to model reasoning?, Do base models already contain hidden reasoning ability?. That reframes the threshold: the procedural phase is consolidating skills that already exist latently; the strategic phase is learning to deploy them well.

Two more notes sharpen the strategic side. SkillRL treats successes and failures asymmetrically — successes as concrete procedures to imitate, failures as abstracted strategic lessons — which mirrors exactly the procedural-then-strategic split at the level of memory Should successful and failed episodes be processed differently?. And when models stall on a plateau, numerical rewards can't tell them *why* they failed; natural-language critiques can, which is a strategic signal that pure execution feedback lacks Can natural language feedback overcome numerical reward plateaus?. Both point the same way: once execution is consolidated, the remaining gains come from strategy, and strategy needs richer feedback than 'right or wrong.'

The thing worth carrying away: the transition isn't a depth limit you bump into — it's a change in what's scarce. Early on, correctness of steps is scarce, so learning is procedural. Once steps are reliable, *choosing the right plan* becomes scarce, and the system has to learn breadth, abstraction, and why-it-failed reasoning instead of just deeper chains.

Sources 7 notes

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Can abstractions guide exploration better than depth alone?

RLAD jointly trains abstraction and solution generators, showing that allocating test-time compute to diverse abstractions outperforms parallel solution sampling at large budgets. Abstractions create structured breadth-first exploration that prevents the underthinking failure mode of depth-only reasoning chains.

Does procedural knowledge drive reasoning more than factual retrieval?

Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Do depth thresholds correspond to transitions between procedural and strategic learning?

Sources 7 notes

Next inquiring lines