How does upward distillation transfer knowledge from smaller to larger networks?

This asks about "upward distillation" — transferring knowledge from a small model up into a larger one — but I should be straight with you first: the corpus has very little on that exact direction, and what it does have actually inverts the premise in interesting ways.

This explores whether knowledge can flow *up* — from a smaller network into a larger one — which is the reverse of how distillation normally works. The standard picture in this collection runs the other way: a big, well-informed teacher compresses what it knows into a smaller student. Does richer teacher context hurt student generalization? shows that even within that conventional setup the transfer is lossy in subtle ways — a teacher that sees the correct answer and verifier output hands the student confident, concise reasoning traces, but that confidence suppresses the student's ability to express uncertainty and quietly degrades its performance on out-of-distribution problems. So even "downward" distillation isn't a clean copy; it transmits style and disposition, not just facts.

The more provocative thread here is that the collection keeps questioning whether bigger is the thing you'd even want to distill *toward*. A single 7M-parameter two-layer network, recursing on its own latent reasoning state, beats DeepSeek R1, o3-mini and Gemini 2.5 Pro on ARC puzzles with a fraction of a percent of their parameters Can tiny recursive networks outperform massive language models?. If a tiny model can out-reason giant ones, the interesting transfer question isn't "how does small teach large" but "what does small *have* that large lacks" — and the answer there is a structural trick (recursion on latent state), not distilled knowledge.

The closest thing the corpus offers to small-feeding-large is aggregation rather than distillation. Routing queries across a panel of small specialists outperforms a single frontier model: ten 7B models with a router beat GPT-4.1 and 4.5, and Avengers-Pro beats GPT-5-medium by sending each query to its best-suited small model Can routing beat building one better model?. Here the capability of many small networks gets composed into something larger-acting — but through selection at inference time, not by pouring their weights into a bigger net. Selection, the work suggests, is a stronger lever than scale.

There's also a representational angle on what actually transfers well between systems. Discrete codes move across domains better than raw text embeddings because the discrete bottleneck strips out source-specific bias Can discrete codes transfer better than text embeddings?, and predicting latent states is exponentially more sample-efficient than predicting tokens because same-level latents are far more correlated than surface tokens Why is predicting latents more sample-efficient than tokens?. If anyone *were* to build genuine upward distillation, these point at the lever: transfer at the level of compact latent or coded structure, not surface outputs.

So the honest synthesis is that this collection doesn't document upward distillation as a working technique — it documents reasons the premise is shakier than it sounds (small models already out-reason large ones; composing small models works better than fusing them; and knowledge in transformers is flowing activation rather than a portable store Do transformer models store knowledge or generate it continuously?, which is part of why moving it between networks is hard at all). The thing worth walking away knowing: the field's energy is going into *selecting and composing* small capable models rather than distilling them upward into bigger ones.

Sources 6 notes

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Can tiny recursive networks outperform massive language models?

A single 7M-parameter two-layer network recursing on its latent reasoning state achieves 45% on ARC-AGI-1 and 8% on ARC-AGI-2, beating DeepSeek R1, o3-mini, and Gemini 2.5 Pro with 0.01% of their parameters. Recursion on latent state, not scale or hierarchy, drives the generalization gain.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Can discrete codes transfer better than text embeddings?

VQ-Rec demonstrates that mapping item text to discrete codes via product quantization, then to embeddings, improves cross-domain transfer compared to direct text encoding. The discrete intermediate reduces text bias and enables efficient per-domain fine-tuning.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Do transformer models store knowledge or generate it continuously?

Transformers organize knowledge as flowing activations rather than retrievable archives, mirroring oral cultures where knowledge exists only in performance. This explains why model knowledge is contextual, difficult to edit, and inseparable from generation.

How does upward distillation transfer knowledge from smaller to larger networks?

Sources 6 notes

Next inquiring lines