Does RL training follow a predictable two-phase learning sequence?
This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
Across eight text-only and vision-language models, RL training reveals a consistently two-phase dynamic. In the first phase, the learning bottleneck is procedural correctness — a single calculation error invalidates an entire solution, creating powerful gradient signal that compels mastery of low-level execution tokens (arithmetic, variable substitution, formula application). In the second phase, the bottleneck shifts to strategic planning — exploring and mastering high-level planning tokens (deduction like "we can use the fact that," branching like "let's try a different approach," backtracing like "but the problem mentions that").
The phases are not mutually exclusive. Procedural refinement continues throughout training. But the primary driver of marginal performance gains shifts to strategic planning. This is why the "aha moment" phenomenon appears when it does — it represents the discovery and internalization of high-level reasoning strategies, which only becomes the active learning frontier after procedural skills are consolidated.
The entropy dynamics tell the same story. Planning tokens show increasing strategic diversification over training — the model explores new ways to combine established skills. Execution tokens show stable conditional entropy — once arithmetic is mastered, there's little incentive to find diverse ways to perform it. The performance improvement comes from discovering new combinations of established skills, which is the core function of planning.
This insight exposes a core inefficiency in algorithms like GRPO that apply optimization pressure uniformly across all tokens. If the learning frontier is in planning tokens but gradient signal is diluted across execution tokens, optimization is wasteful. HICRA addresses this by concentrating optimization on planning tokens, achieving significant performance gains.
The connection to existing insights is illuminating. Since Which sentences actually steer a reasoning trace?, HICRA's planning tokens are likely the same phenomenon identified from a mechanistic perspective. The two-phase dynamic also explains why Do reasoning cycles in hidden states reveal aha moments? — the graph structure reflects the transition from procedural execution (local structure) to strategic planning (global topology).
Source: Reinforcement Learning
Related concepts in this collection
-
Which sentences actually steer a reasoning trace?
Can we identify which sentences in a reasoning trace have outsized influence on the final answer? Three independent methods converge on a surprising answer about planning and backtracking.
converges: planning tokens in HICRA likely correspond to thought anchors
-
Do reasoning cycles in hidden states reveal aha moments?
What if the internal loops in model reasoning—visible in hidden-state topology—correspond to the reconsidering moments that happen during reasoning? This note explores whether graph cyclicity captures a mechanistic signature of insight.
extends: the two-phase dynamic explains how graph topology evolves during training
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
reframes: entropy collapse may be acceptable for execution tokens but catastrophic for planning tokens
-
Does RL teach reasoning or just when to use it?
Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
deepens: the "when" is specifically about planning tokens; execution tokens are "how"
-
What happens inside models when they suddenly generalize?
Grokking appears as an abrupt shift from memorization to generalization. But is the underlying process truly discontinuous, or does mechanistic analysis reveal continuous phases we can measure and predict?
analogous phased development: grokking's memorization-then-circuit-formation parallels the procedural-then-strategic progression; both show that generalization requires passing through a consolidation phase before higher-order structure emerges
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
rl training exhibits a two-phase dynamic where procedural consolidation precedes strategic planning exploration