LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

Paper · arXiv 2502.07374 · Published February 11, 2025

Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-ofthoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on Live- CodeBench, competitive to the proprietary o1- preview model’s score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models.

we first show that, surprisingly, an LLM can be cheaply and easily taught to produce Long CoT responses, significantly improving its reasoning capabilities

Even further, the model can achieve o1-preview performance by updating fewer than 5% parameters with LoRA fine-tuning (Hu et al., 2021). We show that the model successfully learns to reflect and revise its intermediate thoughts (e.g., frequently using reasoning keywords such as “Alternatively” and “Wait, but”) and adopts long, coherent CoTs to tackle challenging problems (Fig. 1).

Moreover, we identify the Long CoT structure as the key characteristic of distilled data for eliciting strong reasoning performance rather than the specific contents of individual reasoning steps within the Long CoT. To test this, we conduct two sets of controlled studies by altering either the content of individual reasoning steps or the overall logical structure. To alter content, we perturb samples by replacing numbers with random digits or deleting reasoning keywords.

Surprisingly, we find that these perturbations have little impact on the model performance: even when 50% of numbers in training samples are randomly changed, the model only observes 3.3% lower accuracy on the most challenging math benchmark, AIME 2024, as compared to training with correct samples. To alter the global reasoning structure, we separate responses into reasoning steps and randomly shuffle, insert, or delete these steps. We observe that the trained model is much more sensitive to structural perturbations that break logical coherency in the long CoT. For example, when 67% of the training samples’ reasoning steps are shuffled, accuracy drops by 13.3% on AIME 2024 problems relative to training with correct samples.

Related work

Test Time Scaling for Large Language Models Scaling test-time compute has proven effective in enhancing the the reasoning capabilities of LLMs. This can be broadly categorized into two directions: single long CoT and repeatedly sampled CoT. The former trains models, such as OpenAI o1, DeepSeek R1, and Qwen QwQ, to generate individual, long CoT responses with in-context reflection and backtracking to handle complex reasoning tasks (Guo et al., 2025; Jaech et al., 2024; Team, 2024). Alternatively, repeated sampling methods, such as Best-of-N or search-guided generation (e.g., MCTS), improve reasoning performance by invoking multiple responses from the model, sometimes guided by search algorithms and reward models (Snell et al., 2024; Brown et al., 2024). In this paper, we focus on distilling the ability to generate individual, Long CoTs, and show it can be done in a data- and parameter-efficient manner.

Training to improve reasoning capabilities of LLMs LLM reasoning capabilities can be improved by approaches such as iterative self-improvement and reinforcement learning (RL) (Zelikman et al., 2022; Lightman et al., 2023; Lambert et al., 2024; Yuan et al., 2024; Guo et al., 2025). More recently, Tulu-3 introduces Reinforcement Learning with Verifiable Rewards (RLVR) to improve performance in tasks such as math and coding (Hendrycks et al., 2021c; Jain et al., 2024; LI et al., 2024). PRIME proposes a RL-based method without process labels (Yuan et al., 2024). The recent release of DeepSeek R1 (Guo et al., 2025) demonstrates that LLMs can learn to produce long CoT and improve reasoning using a pure RL-based approach. Instead of bootstrapping reasoning ability, this paper focuses on the surprising data- and parameter-efficiency of distilling reasoning abilities from an existing reasoning model to an LLM.

Long CoT: Structure Is The Key

Motivated by the observation that fine-tuning with a small number of samples can significantly enhance model reasoning performance, we investigate the key factors driving this improvement. Specifically, we explore the contributions of two dimensions to the learning process:

The local content within a reasoning step, including the correctness of the final answer, numbers in math derivations, and the use of reasoning keywords.
The global reasoning structure, including reflection, self-validation, and backtracking across multiple reasoning steps to form a logically coherent long CoT.

To understand their impact, we conduct two studies: (1) we perturb the content within individual reasoning steps – such as the final answer, numerical digits, and reasoning keywords(§4.1), and (2) we modify the global reasoning structure by inserting, deleting, and shuffling reasoning steps(§4.2). We compare the performance of models trained on perturbed samples against both the base Qwen2.5-32BInstruct model (i.e., Original) and model trained on correct, unperturbed samples (i.e., Correct), as shown in Tab. 2. Our findings show that the learning process is highly sensitive to modifications in the global reasoning structure, but remarkably tolerant to errors in the local contents.