Can natural language feedback overcome numerical reward plateaus?
Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.
Three failure modes of purely numerical RL for reasoning: (1) performance plateaus despite 8x scaling of training examples (from 4k to 32k); (2) self-reflection behaviors during RL, often celebrated as "aha moments," contribute minimally to successful problem-solving; (3) persistent failures on certain problems despite extensive trial-and-error training. The common cause: numerical feedback contains limited information about WHY a response is correct or incorrect and HOW to improve.
Critique-GRPO demonstrates that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems when provided with chain-of-thought critiques. The key is integrating both natural language feedback (NLF) and numerical feedback within online RL. The model learns from initial responses and critique-guided refinements simultaneously while maintaining exploration.
This is significant because it challenges the implicit assumption that RL's learning signal is sufficient for arbitrarily complex reasoning. Since Does reflection in reasoning models actually correct errors?, the ineffectiveness of self-reflection during RL training is predictable — the model cannot generate useful critiques of its own failures. External critiques break the ceiling because they provide the information that numerical rewards lack: specific identification of where reasoning went wrong.
The practical architecture has three components: (1) the model generates initial responses; (2) a reasoning-based reward model generates CoT critiques identifying flaws; (3) a shaping function enhances learning from valid refinements and heavily penalizes failed refinements. This approach encourages the model to integrate targeted refinements while preserving exploration.
Since Do critique models improve diversity during training itself?, the NLF mechanism works by expanding the effective exploration space — critiques point toward regions of solution space that numerical rewards cannot identify.
Semantic reward shaping as lightweight NLF: The Semantic Reward Shaping paper proposes a complementary mechanism: using a small encoder-only transformer to compute cosine similarity between generated explanations and ground-truth references. This provides a dense, semantically rich reward signal within GRPO — not as information-rich as full CoT critiques, but vastly cheaper and faster than LLM-as-judge evaluation. The approach combines semantic similarity reward with auxiliary correctness and formatting rewards, significantly improving explanation faithfulness over SFT baselines. This occupies a middle ground between brittle keyword metrics (ROUGE) and expensive LLM-based critiques — suggesting the NLF principle scales down to lightweight implementations when full CoT critique is impractical.
Textual gradients as generalized NLF: TextGrad (2406.07496) formalizes the broader principle: natural language criticism can serve as "textual gradients" propagated through arbitrary computation graphs including LLM API calls, simulators, and external solvers. Each AI system component is a node in a computation graph; textual feedback describes how variables should change to improve the system. This extends NLF from RL plateau-breaking to general AI system optimization — the same principle (informative language feedback > scalar signal) applies at the system level, not just the training level.
Source: Reinforcement Learning, Reward Models; enriched from LLM Architecture
Related concepts in this collection
-
Do critique models improve diversity during training itself?
Explores whether critique integrated into the training loop, beyond test-time scoring, actively maintains solution diversity and prevents the model from converging too narrowly during iterative self-training.
extends: NLF is the mechanism by which critique-driven exploration improves diversity
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
explains: why self-reflection fails to break plateaus; external critique is needed
-
Does revising your own reasoning actually help or hurt?
Self-revision in reasoning models often degrades accuracy, while external critique improves it. Understanding what makes revision helpful or harmful could reshape how we design systems that need to correct themselves.
directly supports: external NLF breaks plateaus; internal reflection does not
-
Does policy entropy collapse limit reasoning performance in RL?
As reinforcement learning models become more confident in their policy choices, entropy drops and performance plateaus. Can we identify and counteract this bottleneck to sustain scaling?
connects: NLF may work by re-expanding entropy in the specific regions where the model has collapsed
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
natural language feedback breaks rl performance plateaus that scaling numerical rewards alone cannot resolve