INQUIRING LINE

Why does information asymmetry between teacher and student enable effective feedback learning?

This explores why a teacher knowing something the student doesn't — privileged access to answers or verifier output — is the precondition that makes corrective feedback useful, and what that gap costs as well as buys.


This explores why a teacher knowing something the student doesn't — privileged access to answers or verifier output — is the precondition that makes corrective feedback useful. The core idea is almost tautological once stated: if teacher and student share identical uncertainty, there's nothing to correct toward. The gradient that pulls a student toward better answers only exists because the teacher sits on information the student lacks Why does teacher-student information asymmetry enable learning signals?. This is why reframing static tasks as pedagogical dialogues works — the student isn't imitating conversation patterns, it's learning to actively extract the signal the teacher is sitting on Can LLMs learn to ask for feedback during problem solving?.

But the corpus complicates the simple story in an interesting direction: the *form* of the asymmetric signal matters as much as its existence. Scalar rewards are the thinnest possible channel — they tell a student how well it did but not how to change, and those two are orthogonal kinds of information Can scalar rewards capture all the information in agent feedback?. This is exactly why models plateau under numerical rewards and then break through when handed a natural-language critique: the critique carries the directive content about *why* a failure happened that a number throws away Can natural language feedback overcome numerical reward plateaus?. Richer feedback can even be converted into dense gradients by letting the policy, conditioned on retrospective evidence of its own mistakes, act as its own teacher — collapsing the asymmetry into a self-distillation loop Can environment feedback replace scalar rewards in policy learning?.

Here's the turn most readers won't expect: more asymmetry is not strictly better. A teacher conditioned on correct answers produces confident, concise traces — and the student inherits that confidence, including in cases where it shouldn't be confident. The result is sharper in-domain performance bought at the cost of out-of-distribution robustness, because the student learns to suppress the epistemic caution it would need on unfamiliar problems Does richer teacher context hurt student generalization?. Relatedly, even objectively higher-quality teacher refinements can degrade a student when they exceed its learning frontier; the student does better filtering for what's compatible with its own profile than swallowing everything the more-knowledgeable teacher offers Does teacher-refined data always improve student model performance?.

The asymmetry also doesn't have to come from a separate teacher at all. Reverse-curriculum methods manufacture a kind of self-asymmetry by starting the student near a known-correct end state and sliding backward, so each step reveals failure modes using only outcome feedback — process-supervision granularity without a human annotator who knows the steps Can curriculum learning approximate expensive process supervision?. And differential handling of successes versus failures (concrete demonstrations vs. abstracted lessons) shows the student-side processing of asymmetric signals is itself a design lever, not just the teacher's privilege Should successful and failed episodes be processed differently?.

So the honest answer is two-part: information asymmetry enables feedback because it's the only thing that creates a correction gradient — but the same privileged confidence that makes the teacher useful is the thing that can teach a student to be overconfident and brittle. The richest learning comes not from maximizing the gap but from channeling it as directive, not just evaluative, information — and letting the student decide how much of it to absorb.


Sources 9 notes

Why does teacher-student information asymmetry enable learning signals?

Social meta-learning requires information asymmetry—the teacher's access to correct answers or verifier output—to generate meaningful corrective signals. Without this asymmetry, teacher and student share identical uncertainty, making pedagogical correction impossible.

Can LLMs learn to ask for feedback during problem solving?

Research shows that reformulating static tasks as pedagogical dialogues—where a teacher has privileged information and the student must learn to extract it—trains models to actively engage conversation as a problem-solving tool, not just imitate dialogue patterns.

Can scalar rewards capture all the information in agent feedback?

Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.

Can natural language feedback overcome numerical reward plateaus?

Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.

Can environment feedback replace scalar rewards in policy learning?

SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.

Does richer teacher context hurt student generalization?

Teachers conditioned on correct answers and verifier output produce confident, concise traces that students inherit. This style suppresses uncertainty expression, optimizing in-domain performance while degrading generalization to out-of-distribution problems that require epistemic caution.

Does teacher-refined data always improve student model performance?

Teacher-refined data degrades performance when it exceeds the student's learning frontier, even if objectively higher quality. Students should filter refinements using their own statistical profile to retain only compatible improvements.

Can curriculum learning approximate expensive process supervision?

R3 progressively slides the reasoning start state backward from near-completion, creating a curriculum that reveals step-level failure modes using only outcome feedback. This achieves process supervision granularity without expensive human step annotations.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Next inquiring lines