Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem

Paper · arXiv 2506.03295 · Published June 3, 2025

Recent studies have shown that even RL on a single problem (Wang et al., 2025a) can unleash these models’ reasoning capabilities. However, RL is not only expensive but also unstable. Even one-shot RL requires hundreds of GPU hours. This raises a critical question: Is there a more efficient way to unleash the reasoning potential of these powerful base LLMs? In this work, we demonstrate that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs. Our method constructs critique data by collecting diverse model-generated solutions to a single problem and using teacher LLMs to provide detailed critiques.

Among various post-training methods, reinforcement learning with verifiable rewards (RLVR) (Guo et al., 2025) has shown particular promise in enhancing reasoning ability by enabling models to learn through trial-and-error exploration (Zeng et al., 2025; Ma et al., 2025). Interestingly, recent studies reveal that even a single training example can significantly improve model performance through 1-shot RLVR (Wang et al., 2025a). These findings suggest that base models inherently possess substantial reasoning potential, which can be effectively unleashed with minimal and targeted training signals.

Recently, Critique Fine-Tuning (CFT) has emerged as a promising alternative (Wang et al., 2025b). By enabling models to learn from critiques of diverse incorrect solutions, CFT can enhance the model’s exposure to varied reasoning patterns and mitigates overfitting. Specifically, CFT introduces diversity by allowing teacher models to critique a wide range of candidate answers to a given problem. This exposes the LLM to multiple perspectives and error types, thereby more effectively unleashing its reasoning potential.

This leads to the question:

Can critiques from a single problem suffice to unleash LLMs’ reasoning potential, achieving RLVRlevel effectiveness with minimum cost?