Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does teacher-refined data always improve student model performance?

Explores whether higher-quality training data from teacher models uniformly benefits student models, or if compatibility with the student's current learning state matters for effective instruction.

Note · 2026-02-22 · sourced from Reasoning by Reflection
How do you build domain expertise into general AI models? How should researchers navigate LLM reasoning research?

Standard instruction tuning improvement pipelines assume: teacher refines training data → student trains on refined data → student improves. Selective Reflection-Tuning challenges this with a compatibility argument: data quality is relative to the student, not absolute. A response "improved" by a GPT-4 teacher may introduce knowledge complexity or reasoning patterns that conflict with the student's current knowledge state — producing degraded training signal despite being objectively higher quality.

The fix: after teacher refinement, have the student model evaluate each refined sample and decide whether to incorporate it. The student uses its own statistical profile as the selection criterion — what it finds tractable and useful given its current weights. Teacher-refined data the student can't process effectively is filtered out; compatible refinements are retained.

The underlying argument is metacognitive: the appropriate training signal for a model at capability level T is not the best possible response in absolute terms but the best response compatible with the model's current learning frontier. Overshoot in data quality creates a mismatch analogous to teaching advanced calculus before arithmetic is solid — the instruction is correct but the student can't absorb it.

This adds a dimension to the SFT quality literature. Correctness of training targets is necessary but not sufficient — compatibility with the specific student's current distribution is equally required. A data-quality pipeline that doesn't account for student compatibility will produce inconsistent results across different model sizes, initializations, and training stages.

Connects to Does supervised fine-tuning actually improve reasoning quality?: both identify SFT quality failures; this paper adds that even "better" data in absolute terms can degrade performance if the student-compatibility dimension is ignored.

Teacher benchmark scores don't predict teaching effectiveness (OpenThoughts): In SFT data curation for reasoning models, QwQ-32B outperforms DeepSeek-R1 as a teacher despite scoring lower on target reasoning benchmarks. This extends the student-compatibility argument: even the teacher dimension is not just about absolute quality. A weaker-performing model may produce responses whose reasoning patterns are more compatible with the student's learning frontier. Additional findings: quality source selection beats diversity (top 1-2 question sources > top 8-16), difficulty-based and response-length filtering outperform embedding-based or fastText filters, and sampling 16x answers per question is an effective scaling strategy — increasing dataset size 16x through multi-answer sampling drives significant gains.


Source: Reasoning by Reflection

Related concepts in this collection

Concept map
15 direct connections · 138 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

teacher-refined instruction data requires student-model selection because refinement compatibility depends on the student's current distribution