Can unified policies handle negative feedback and critique transformation simultaneously?
This explores whether a single learned policy can do two jobs at once — learn from what went wrong (negative feedback) and turn criticism into something actionable (critique transformation) — rather than splitting those into separate components.
This explores whether a single learned policy can do two jobs at once: learn from what went wrong, and turn criticism into something the system can act on. The corpus suggests the question hides a deeper one — what kind of information feedback actually carries, and whether you lose anything by collapsing it. The most useful insight here is that "negative feedback" and "critique" aren't the same thing. Agent feedback splits into two orthogonal channels: an *evaluative* signal (how bad was this?) and a *directive* one (here's how to fix it). A scalar reward captures the first and throws away the second Can scalar rewards capture all the information in agent feedback?. That distinction is exactly why critique transformation matters — it recovers the directional information a thumbs-down loses.
There's strong evidence the two can be unified. The cleanest example: language models converting a user's complaint — "doesn't look good for a date" — directly into a positive preference like "prefer more romantic," so a retrieval system finds better matches without retraining Can language models bridge the gap between critique and preference?. That's negative feedback and critique transformation happening in one pass. On the recommender side, the unified-policy case is even more direct: folding what-to-ask, what-to-recommend, and when into a single policy beats optimizing them separately, because separation blocks gradient signals from informing each other Can unified policy learning improve conversational recommender systems?. The argument for unification is the same in both: keeping the jobs apart wastes information that wants to flow between them.
The reinforcement-learning side of the corpus shows why critique-as-transformation outperforms raw negative reward. Models stuck on a numerical-reward plateau start producing correct solutions once you hand them a chain-of-thought critique explaining *why* they failed — the number alone never carried that Can natural language feedback overcome numerical reward plateaus?. A related method skips the external reward model entirely: feed the policy retrospective evidence of its own mistakes in-context and it acts as its own process critic, converting rich feedback into dense gradients Can environment feedback replace scalar rewards in policy learning?. So a unified policy doesn't just *tolerate* both signals — the directive critique is what makes the negative signal teachable.
Here's the surprise the corpus offers: negative feedback alone is more powerful than people assume. Training on only negative samples — suppressing wrong trajectories — matches or beats full RL, because it preserves solution diversity where positive-only reinforcement collapses it by piling probability onto a few winners Does negative reinforcement alone outperform full reinforcement learning?. Critique models reinforce this from another angle: injecting step-level critique during training keeps exploration diverse and prevents premature convergence Do critique models improve diversity during training itself?. And there's a hint that the two signal types may want *asymmetric* handling, not identical treatment — successes stored as concrete demonstrations, failures abstracted into lessons Should successful and failed episodes be processed differently?. That's the one caution against naive unification: a single policy can carry both, but it may need to process them differently inside.
The honest limit: the corpus has no paper benchmarking a *single* policy explicitly doing negative-feedback learning and critique-transformation side by side. What it gives you instead is the architecture of the answer — feedback decomposes into evaluation plus direction, unification beats separation when it lets those channels cross-inform, and pure self-improvement without any external critique signal eventually stalls on its own circularity Can models reliably improve themselves without external feedback?. The pieces are all here; no one note assembles them for you.
Sources 9 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.
Research shows that formulating attribute-asking, item-recommending, and timing decisions as a single graph-based RL policy achieves better joint optimization than isolated components. Separation prevents gradient signals from informing one another and fails to optimize conversation trajectory holistically.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
SDPO converts tokenized environment feedback into dense gradient signals by using the feedback-conditioned policy as a self-teacher. The policy, when given retrospective evidence of its mistakes in-context, implicitly acts as its own process reward model, making external reward signals unnecessary.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.