Reinforcement Learning for LLMs

Can natural language feedback overcome numerical reward plateaus?

Exploring whether chain-of-thought critiques can push past performance ceilings that scaling data alone cannot break in reinforcement learning for reasoning tasks.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Three failure modes of purely numerical RL for reasoning: (1) performance plateaus despite 8x scaling of training examples (from 4k to 32k); (2) self-reflection behaviors during RL, often celebrated as "aha moments," contribute minimally to successful problem-solving; (3) persistent failures on certain problems despite extensive trial-and-error training. The common cause: numerical feedback contains limited information about WHY a response is correct or incorrect and HOW to improve.

Critique-GRPO demonstrates that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems when provided with chain-of-thought critiques. The key is integrating both natural language feedback (NLF) and numerical feedback within online RL. The model learns from initial responses and critique-guided refinements simultaneously while maintaining exploration.

This is significant because it challenges the implicit assumption that RL's learning signal is sufficient for arbitrarily complex reasoning. Since Does reflection in reasoning models actually correct errors?, the ineffectiveness of self-reflection during RL training is predictable — the model cannot generate useful critiques of its own failures. External critiques break the ceiling because they provide the information that numerical rewards lack: specific identification of where reasoning went wrong.

The practical architecture has three components: (1) the model generates initial responses; (2) a reasoning-based reward model generates CoT critiques identifying flaws; (3) a shaping function enhances learning from valid refinements and heavily penalizes failed refinements. This approach encourages the model to integrate targeted refinements while preserving exploration.

Since Do critique models improve diversity during training itself?, the NLF mechanism works by expanding the effective exploration space — critiques point toward regions of solution space that numerical rewards cannot identify.

Semantic reward shaping as lightweight NLF: The Semantic Reward Shaping paper proposes a complementary mechanism: using a small encoder-only transformer to compute cosine similarity between generated explanations and ground-truth references. This provides a dense, semantically rich reward signal within GRPO — not as information-rich as full CoT critiques, but vastly cheaper and faster than LLM-as-judge evaluation. The approach combines semantic similarity reward with auxiliary correctness and formatting rewards, significantly improving explanation faithfulness over SFT baselines. This occupies a middle ground between brittle keyword metrics (ROUGE) and expensive LLM-based critiques — suggesting the NLF principle scales down to lightweight implementations when full CoT critique is impractical.

Textual gradients as generalized NLF: TextGrad (2406.07496) formalizes the broader principle: natural language criticism can serve as "textual gradients" propagated through arbitrary computation graphs including LLM API calls, simulators, and external solvers. Each AI system component is a node in a computation graph; textual feedback describes how variables should change to improve the system. This extends NLF from RL plateau-breaking to general AI system optimization — the same principle (informative language feedback > scalar signal) applies at the system level, not just the training level.


Source: Reinforcement Learning, Reward Models; enriched from LLM Architecture

Related concepts in this collection

Concept map
16 direct connections · 125 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

natural language feedback breaks rl performance plateaus that scaling numerical rewards alone cannot resolve