Train Long, Think Short: Curriculum Learning for Efficient Reasoning

Paper · arXiv 2508.08940 · Published August 12, 2025

Recent work on enhancing the reasoning abilities of large language models (LLMs) has introduced explicit length control as a means of constraining computational cost while preserving accuracy. However, existing approaches rely on fixed-length training budgets, which do not take advantage of the natural progression from exploration to compression during learning. In this work, we propose a curriculum learning strategy for length-controlled reasoning using Group Relative Policy Optimization (GRPO). Our method starts with generous token budgets and gradually tightens them over training, encouraging models to first discover effective solution strategies and then distill them into more concise reasoning traces. We augment GRPO with a reward function that balances three signals: task correctness (via verifier feedback), length efficiency, and formatting adherence (via structural tags). Experiments on GSM8K, MATH500, SVAMP, College Math, and GSM+ demonstrate that curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency.

To date, two main paradigms have emerged to improve reasoning in LLMs. The first relies on supervised fine-tuning (SFT) on datasets containing chain-of-thought (CoT) annotations, where human experts provide intermediate reasoning steps. Although SFT is straightforward to implement, it depends on costly data collection and may struggle to generalize beyond seen distributions. The second paradigm uses reinforcement learning (RL) to directly optimize the behavior of the model through feedback on the completed reasoning traces. RL-based methods avoid explicit reasoning annotations, can leverage sparse rewards, and have achieved state-of-the-art performance in recent systems.

Within the RL category, Group Relative Policy Optimization (GRPO) has shown particular promise. GRPO fine-tunes LLMs without a separate value function by sampling a group of candidate responses per prompt and normalizing rewards across that group. This group-relative normalization stabilizes learning from sparse correctness signals and encourages the model to prefer responses that are strong relative to its own cohort.