How does behavior cloning reduce complexity before RL training in rerankers?
This reads 'behavior cloning' as the imitation/supervised warm-start phase that runs before reinforcement learning — and asks how copying demonstrated behavior shrinks the search space RL then has to optimize over; the corpus doesn't speak to rerankers specifically, but it covers the imitation-before-RL pattern in depth.
This explores how a behavior-cloning (imitation) phase makes the later RL phase tractable — and while the collection has no note on rerankers as such, the underlying mechanic shows up clearly in the work on curriculum and warm-start training. The cleanest statement of it is the finding that running supervised imitation first, then RL against verifiable rewards, beats either method alone: the imitation phase exists precisely to create 'reasonable rollouts the RL phase can then sharpen' Does sequencing imitation then exploration training improve reasoning?. That's the complexity reduction in one sentence — outcome rewards are nearly useless when a fresh policy almost never produces a good trajectory to reward, so cloning demonstrated behavior lifts the policy into a region where the reward signal actually carries information.
You can see why this matters by looking at what happens when RL has to do the exploring on its own with sparse signal. Training on problems that are too hard for the current policy doesn't teach reasoning — it teaches degenerate shortcuts, because group-relative normalization treats the rare accidental success as a high-advantage trajectory and reinforces answer-repetition and computation-skipping Do overly hard RLVR samples actually harm model capabilities?. Behavior cloning heads this off by raising the baseline competence so that 'success' is common enough to be meaningful rather than accidental. In a reranker, the analogous move is cloning a teacher's ordering decisions so the policy starts from sensible rankings, and RL only has to refine the margins.
The other half of the answer is about what cloning preserves that pure RL destroys. RL reliably compresses behavioral diversity — search agents converge on narrow reward-maximizing strategies through the same entropy-collapse seen in reasoning, and it's specifically supervised fine-tuning on diverse demonstrations that keeps exploration breadth alive Does reinforcement learning squeeze exploration diversity in search agents?. Relatedly, RL tends to collapse onto a single dominant format inherited from pretraining within the first epoch Does RL training collapse format diversity in pretrained models?. So cloning isn't just a head start; it's a way of installing the variety of good behaviors before RL's narrowing pressure kicks in.
There's also a structural reason the warm-start is cheap to exploit. RL doesn't rewrite the whole network — it updates only 5–30% of parameters, in sparse but nearly full-rank subnetworks that are consistent across seeds Does reinforcement learning update only a small fraction of parameters?. That fits the division of labor: behavior cloning sets the bulk of the competent behavior, and RL makes a small, targeted adjustment on top. The two-phase view reinforces this — RL itself proceeds from procedural mastery first to strategic refinement second Does RL training follow a predictable two-phase learning sequence?, and cloning essentially pre-pays the procedural-mastery phase so RL can spend its budget on the strategic part.
The thing you might not have expected: the value of behavior cloning isn't mainly that it saves compute — it's that it makes the reward signal *legible*. A reward you can't earn teaches nothing, or worse, teaches a shortcut. If you want the sharper contrast between imitation-foundation and reward-refinement, the curriculum note Does sequencing imitation then exploration training improve reasoning? is the doorway; for the failure mode it prevents, start with the overly-hard-samples note Do overly hard RLVR samples actually harm model capabilities?.
Sources 6 notes
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.
Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.
RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Across seven RL algorithms and ten LLM families, RL induces intrinsic parameter sparsity of 5–30% without explicit regularization. Critically, these sparse updates are nearly full-rank and nearly identical across random seeds, indicating structural rather than arbitrary parameter selection.
Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.