Can negative feedback through critiques achieve the same steering flexibility as positive preferences?
This explores whether telling a model what's wrong (critique, negative signal) can steer behavior as richly as telling it what you want (positive preference) — and the corpus suggests negative feedback isn't a weaker substitute but often carries information positive preferences can't.
This explores whether negative feedback — critiques, contradictions, "don't do that" — can steer a model as flexibly as positive preferences, or whether it's a blunter instrument. The collection points somewhere more interesting than a simple yes/no: critique and preference turn out to carry *different* information, and the most flexible steering often comes from converting between them rather than choosing one.
The most direct answer is that critiques and preferences are translatable. A retrieval system can take a natural negative reaction — "doesn't look good for a date" — and have an LLM rewrite it into a positive preference like "prefer more romantic," letting the system find better matches without retraining Can language models bridge the gap between critique and preference?. So at the surface level, negative feedback achieves the same steering as positive preference precisely *by becoming* a positive preference. But that translation hints at why the question matters: the raw critique held something the bare preference didn't.
That "something" is the heart of it. Feedback decomposes into two orthogonal channels — *evaluative* (how good was this?) and *directive* (how should it change?) — and a single scalar reward captures the first while discarding the second Can scalar rewards capture all the information in agent feedback?. A critique in natural language carries the directive channel that a thumbs-up never can. This is why models stuck on reasoning plateaus break through when given chain-of-thought critiques: numerical rewards tell them *that* they failed but not *why* or *how to fix it* Can natural language feedback overcome numerical reward plateaus?. Negative feedback, expressed richly, can actually be *more* steerable than a positive preference signal, not less.
There's also a quieter, structural argument that negative signal does work positive signal cannot. Training only on negative samples — suppressing wrong trajectories — matches or beats full RL, because positive-only reinforcement piles probability onto a few winning answers and collapses diversity, while negative reinforcement prunes the bad without narrowing the good Does negative reinforcement alone outperform full reinforcement learning?. The same asymmetry shows up elsewhere: critique models injected into the training loop keep solutions diverse instead of letting the model prematurely converge Do critique models improve diversity during training itself?, and persona consistency simply cannot be enforced by rewarding good answers — it requires *explicitly punishing contradictions*, because supervised learning never penalizes them Why does supervised learning fail to enforce persona consistency?. Treating success and failure asymmetrically — concrete demos from wins, abstracted lessons from losses — outperforms processing them the same way Should successful and failed episodes be processed differently?.
The catch worth knowing: the flexibility isn't free, and pure preference optimization has its own failure modes. Optimizing for what people *prefer* in a single turn quietly erodes the model's willingness to ask clarifying questions and check understanding — an "alignment tax" where it looks helpful but fails silently across a conversation Does preference optimization harm conversational understanding?. And no feedback regime escapes the need for a real external anchor — purely internal self-correction stalls on circularity and reward hacking Can models reliably improve themselves without external feedback?. The takeaway you might not have gone looking for: critique isn't the poor cousin of preference. It's the channel that carries direction, preserves diversity, and enforces constraints — and the best systems don't pick a side, they translate between the two.
Sources 9 notes
Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
Supervised learning cannot enforce persona consistency because it rewards correct responses but never penalizes contradictions. Offline reinforcement learning combines inexpensive training on existing data with explicit contradiction rewards using human-annotated labels, offering a practical alternative to expensive online RL.
SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.
RLHF optimizes models for single-turn helpfulness by rewarding confident responses over clarifying questions and understanding checks. This preference alignment systematically reduces grounding acts by 77.5% below human levels, creating an alignment tax where models appear helpful but fail silently in multi-turn contexts.
Pure self-improvement stalls due to the generation-verification gap, diversity collapse, and reward hacking. Reliable improvement methods succeed by smuggling in external anchors: past model versions, third-party judges, user corrections, or tool feedback.