How should training data be constructed to preserve teacher-student information gaps?

This explores how to build training data so the teacher's privileged knowledge becomes a usable learning signal — rather than collapsing the very gap that makes teaching work — and what the corpus says about constructing that asymmetry deliberately.

This explores how to build training data so the teacher's privileged knowledge becomes a usable learning signal, rather than erasing the gap that makes teaching possible in the first place. The corpus has a surprisingly coherent position on this, and it starts from a counterintuitive claim: the information gap is the point. One note argues that meaningful corrective feedback only exists *because* the teacher knows something the student doesn't — access to the correct answer or a verifier's output — and without that asymmetry, teacher and student share identical uncertainty and there's nothing to correct Why does teacher-student information asymmetry enable learning signals?. So 'preserving the gap' isn't a side concern; it's the mechanism you're trying to harness.

But here's the twist the corpus keeps returning to: you can overdo it. When teachers are conditioned on the correct answer and verifier output, they produce confident, short, certain traces — and students inherit that confidence wholesale. That feels like a win in-domain, but it quietly suppresses the student's ability to express uncertainty, which wrecks generalization on out-of-distribution problems that actually require epistemic caution Does richer teacher context hurt student generalization?. In other words, the teacher's privileged knowledge leaks into the *style* of the data, and the student copies the style without owning the underlying knowledge. The gap was supposed to generate a learning signal; instead it generated overconfidence.

The resolution running through several notes is that the student, not the teacher, should be the filter. Teacher-refined data — even when objectively higher quality — degrades performance when it exceeds what the student can actually absorb; students do better keeping only the refinements compatible with their own statistical profile Does teacher-refined data always improve student model performance?. So data construction isn't 'dump the teacher's best output and distill' — it's 'expose the gap, then let the student selectively close the parts it can reach.' Walmart's case makes the upside concrete: BERT cross-encoders actually *beat* their LLM teachers once trained on large enough augmented sets of teacher-labeled queries, because the teacher's labels smoothed a much broader input distribution than the teacher itself ever generalized over Can smaller models outperform their LLM teachers with enough data?. The teacher's role was to label breadth, not to be imitated.

There's a deeper reason this matters, and it's about where knowledge lives. Prompting and prompt-optimization can only reorganize what a model already contains — they can't inject knowledge the training data never supplied Can prompt optimization teach models knowledge they lack?, and models routinely ignore in-context information when their parametric priors are strong Why do language models ignore information in their context?. That's exactly why the asymmetry has to be built into the *training data* rather than handled at inference: the gap you don't bake into the data is a gap you can't prompt your way across later. And there are gentler ways to move the distribution — proxy-tuning at decoding time shifts behavior while leaving base weights (and stored knowledge) intact, where direct fine-tuning corrupts lower-layer knowledge Can decoding-time tuning preserve knowledge better than weight fine-tuning? — which is a useful lever when you want the teacher's signal to touch reasoning and style without overwriting what the student already knows.

The thing you might not have expected to learn: 'preserving the information gap' and 'making the student match the teacher' are opposite goals. The corpus suggests the best teacher-student data keeps the asymmetry as a *source of correction* while refusing to transfer the teacher's certainty — wide labeled coverage, student-side filtering, and distributional nudges rather than wholesale imitation.

Sources 7 notes

Why does teacher-student information asymmetry enable learning signals?

Social meta-learning requires information asymmetry—the teacher's access to correct answers or verifier output—to generate meaningful corrective signals. Without this asymmetry, teacher and student share identical uncertainty, making pedagogical correction impossible.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can smaller models outperform their LLM teachers with enough data?

Walmart's student cross-encoders outperformed their LLM teachers when trained on sufficiently large augmented datasets of teacher-labeled queries. The student's broader input distribution exposure, smoothed by teacher predictions, enabled better generalization than the teacher achieved.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

How should training data be constructed to preserve teacher-student information gaps?

Sources 7 notes

Next inquiring lines