How do semantic reward shaping approaches compare to full critique models?
This explores the spectrum between 'reward shaping' — coaxing better behavior by sculpting a numerical signal (rubrics, confidence, dense token rewards) — and 'critique models' that produce full natural-language reasoning about what went wrong, and asks what each buys you.
This explores the gap between two ways of telling a model it did well or badly: a shaped scalar signal versus a full written critique. The corpus is unusually direct about why this gap matters — and the recurring finding is that the number throws away information the language keeps.
The cleanest statement of the difference is that agent feedback actually decomposes into two orthogonal channels: *evaluative* (how good was this?) and *directive* (how should it change?). A scalar reward can carry the first but structurally cannot carry the second Can scalar rewards capture all the information in agent feedback?. That's the mechanism behind the more dramatic result that natural-language critiques break performance plateaus that numerical rewards get stuck on — once a model has saturated what a scalar can teach it, a chain-of-thought critique explaining *why* a solution failed reopens progress Can natural language feedback overcome numerical reward plateaus?. So 'critique vs. reward shaping' isn't two flavors of the same thing; the critique contains a dimension the reward literally can't represent.
But richer isn't automatically better, and this is where the comparison gets interesting. Generative judges that reason step-by-step about a model's reasoning beat classifier-style reward models — and do it with orders of magnitude less training data Can judges that reason about reasoning outperform classifier rewards?. Even reward models themselves improve when allowed to reason before scoring, effectively turning evaluation into a test-time-compute problem Can reward models benefit from reasoning before scoring?. And critique in the training loop does something a final-accuracy number never shows: it preserves solution diversity and prevents the model from prematurely collapsing onto one strategy Do critique models improve diversity during training itself?. The critique's value is partly that it keeps the search wide.
The most useful nuance for someone choosing between approaches is that the two aren't a binary — the strongest results come from making them *cooperate*. Rubrics work best not when their scores are mashed into a dense reward (which invites reward hacking) but when they act as a gate that accepts or rejects whole rollouts, letting token-level rewards optimize only inside valid answers Can rubrics and dense rewards work together without hacking?. Other shaping signals are nearly free and surprisingly effective: a model's own answer-confidence can rank reasoning traces and even repair the calibration that RLHF damages Can model confidence work as a reward signal for reasoning?, and models can be trained to internalize self-evaluation in unused sequence space at zero inference cost Can models learn to evaluate their own work during training?.
The quiet lesson across these is that 'shaping' and 'critique' sit on a continuum of how much *language* you let into the signal. Push further and critique stops being a reward mechanism at all and becomes a translation layer — LLMs can convert a user's negative critique ('doesn't look good for a date') straight into a positive, retrievable preference, no fine-tuning required Can language models bridge the gap between critique and preference?. If you came here for a verdict, it's this: scalar shaping is cheaper and safer against hacking when fenced properly, but full critiques are the only thing that carries directional 'how to fix it' information — and the frontier work spends that information rather than discarding it.
Sources 9 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Critique-GRPO shows that models stuck on performance plateaus can generate correct solutions when given chain-of-thought critiques, revealing that numerical rewards lack critical information about why failures occur and how to improve.
StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.
Three independent teams (RRM, RM-R1, DeepSeek-GRM) discovered that adding chain-of-thought reasoning before reward scoring enables adaptive test-time compute scaling for evaluation. Reasoning-based approaches raise the capability ceiling of reward models beyond what outcome-based evaluation achieves.
Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.
DRO shows that using rubrics to accept or reject rollout groups—rather than converting rubric scores into dense rewards—prevents reward hacking. This separation preserves the categorical strength of rubrics while letting token-level rewards optimize within valid answers.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Few-shot LLM prompting can convert natural negative feedback like "doesn't look good for a date" into positive preferences like "prefer more romantic," enabling retrieval systems to find better-matching recommendations without fine-tuning.