Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback

Paper · arXiv 2506.03106 · Published June 3, 2025
Reinforcement Learning

We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique- GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided refinements simultaneously while maintaining exploration.

Reinforcement learning (RL) has been a key driver of recent advancements in enhancing the reasoning capabilities of large language models (LLMs) [1, 2, 3, 4]. In particular, reinforcement learning with numerical feedback, typically in the form of scalar rewards and often referred to as the R1-Zero training paradigm [2], enables base LLMs to learn from their own generations through trial-anderror learning. High-quality generations are rewarded positively, while low-quality generations are penalized. This paradigm has revolutionized the post-training pipeline for LLMs, shifting from imitation learning of expert demonstrations to learning from the model’s own generations (i.e., experiences) [5, 6], resulting in significant performance improvements.

Performance Plateaus: Scaling the number of training examples by 8x (from 4k to 32k) fails to improve peak performance. (ii) Limited Effectiveness of Self-Reflection: Increased self-reflection behaviors during RL finetuning, often touted as crucial “Aha moments,” contribute minimally to successful problem-solving. (iii) Persistent Failures: The models exhibit persistent failures on certain problems despite extensive trial-and-error finetuning. We hypothesize that a key cause of these plateaus and persistent failures is the limited information contained within numerical feedback regarding why a response is correct or incorrect and how to improve it. Furthermore, the limited effectiveness of self-reflection behaviors compounds these challenges. Together, these limitations underscore the need for external oversight or richer feedback mechanisms to support more effective learning.

To address these limitations, natural language feedback (NLF), typically in the form of textual critiques, offers a promising avenue. NLF provides detailed and targeted insights into flaws in model-generated outputs, enabling both accurate evaluation and effective response refinement [8, 9]. However, existing approaches often fail to fully exploit the potential of textual critiques. Many studies [10, 11, 12, 13, 14] primarily use critiques for evaluation, transforming them into numerical rewards for model improvement via RL algorithms such as Proximal Policy Optimization (PPO) [15] or Group Relative Policy Optimization (GRPO) [16]. This transformation often discards valuable constructive information embedded within the critiques. Some studies [9, 17] utilize critiques to generate refinements and fine-tune models on these refinements through supervised learning. While effective, these offline approaches are limited by their inability to support consistent exploration and online refinement. This raises a key research question: Can we incorporate critiques into an online reinforcement learning framework to enable LLMs to spontaneously learn from both initial generations and refinements?

To address the challenges of RL with solely numerical feedback, we first examine the potential of RL-finetuned models (which exhibit performance plateaus) to refine their responses using critiques. Specifically, we assess whether these models can generate correct refinements on persistently failed problems by leveraging critiques. The results in Section 3 demonstrate that RL-finetuned models exhibit effective refinements when provided with chain-of-thought (CoT) critiques [17, 11], which offer a step-by-step analysis of whether the generated response is correct or not. Building on this, we propose Critique-GRPO, a novel framework that enables LLMs to learn from both natural language feedback (NLF) and numerical feedback during online RL for effective policy optimization. As illustrated in Figure 1(a), Critique-GRPO allows the model to learn from both initial sampled responses and their refinements using critiques generated by a reasoning-based reward model (highlighted in green). This approach encourages the model to integrate targeted refinements while preserving exploration. Additionally, a shaping function [18] is applied to enhance learning from valid refinements and heavily penalize failed refinements, which often contain unresolved errors.