Why does mixing reasoning traces from different teachers destabilize learning?
This explores why blending chain-of-thought traces from multiple teacher models hurts a student, even when each teacher is individually good — and the corpus points to a style-and-distribution clash rather than a content problem.
This reads the question as being about *teacher heterogeneity* — what goes wrong when a student learns from a mix of reasoning traces produced by different source models. The corpus suggests the destabilizer isn't bad reasoning content; it's that traces teach by imitation of *form*, and different teachers hand the student conflicting forms to imitate at once.
Start with what a trace actually transmits. Models trained even on systematically corrupted or irrelevant traces keep their accuracy Do reasoning traces need to be semantically correct?, which means a trace works as computational scaffolding, not as a chain of meaningful logical steps. Chain-of-thought is closer to constrained imitation than inference — format effects dominate content, and structurally odd prompts still succeed What makes chain-of-thought reasoning actually work?. So when you mix teachers, you're not blending two correct arguments; you're blending two *styles of pattern* the student is being asked to reproduce simultaneously. There's no single form to converge on.
Those styles carry hidden statistical signatures that don't compose. Teachers conditioned on answers and verifier output produce confident, concise traces that suppress uncertainty, trading out-of-distribution robustness for in-domain polish Does richer teacher context hurt student generalization?. A cautious teacher and a confident teacher therefore push the student's calibration in opposite directions. Trace length compounds this: length reflects how close a problem is to the teacher's *training distribution*, not the problem's difficulty Does longer reasoning actually mean harder problems?. Mix teachers with different training distributions and the length signal the student absorbs becomes noise — long where it should be short, and vice versa.
There's also a mechanical reason mixing is fragile. Most reasoning errors are *local* — driven by the immediately preceding tokens — accounting for up to two-thirds of failures, and getting worse under distributional shift Where do memorization errors arise in chain-of-thought reasoning?. Heterogeneous teachers create exactly that shift inside a single training set: the local token patterns the student relies on stop being consistent, so the scaffolding breaks at the seams between styles.
The corpus also says the fix isn't "average everything together." Students must *selectively* incorporate teacher refinements based on compatibility with their own learning frontier — objectively higher-quality data degrades performance when it exceeds what the student can absorb Does teacher-refined data always improve student model performance?. And quality beats quantity at the step level: local, step-level confidence catches breakdowns that global averaging masks Does step-level confidence outperform global averaging for trace filtering?. The unexpected takeaway is that more, better, and more-diverse teachers can be *worse* — what stabilizes learning is coherence with the student's own distribution, which is why ordering things (imitate first, then refine) tends to outperform pooling everything at once Does sequencing imitation then exploration training improve reasoning?.
Sources 8 notes
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.
Controlled A* maze experiments show trace length correlates with difficulty only in-distribution but decouples entirely out-of-distribution. Trace length primarily reflects recall of training schemas, not adaptive computation.
STIM framework identifies local, mid-range, and long-range memorization sources in CoT reasoning. Local memorization—based on preceding tokens—accounts for up to 67% of reasoning errors, especially as complexity increases and distributional shift occurs.
Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.
Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.
Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.