Beyond Scaling Law: A Data-Efficient Distillation Framework for Reasoning

Paper · arXiv 2508.09883 · Published August 13, 2025
RLVRTraining Fine TuningReinforcement Learning

Large language models (LLMs) demonstrate remarkable reasoning capabilities in tasks such as algorithmic coding and mathematical problem-solving. Recent methods have improved reasoning through expanded corpus and multistage training combining reinforcement learning and supervised fine-tuning. Although some methods suggest that small but targeted dataset can incentivize reasoning via only distillation, a reasoning scaling laws is still taking shape, increasing computational costs. To address this, we propose a data-efficient distillation framework (DED) that optimizes the Pareto frontier of reasoning distillation. Inspired by the on-policy learning and diverse roll-out strategies of reinforcement learning, the key idea of our approach is threefold: (1) We identify that benchmark scores alone do not determine an effective teacher model. Through comprehensive comparisons of leading reasoning LLMs, we develop a method to select an optimal teacher model. (2) While scaling distillation can enhance reasoning, it often degrades out-of-domain performance. A carefully curated, smaller corpus achieves a balanced trade-off between in-domain and out-of-domain capabilities. (3) Diverse reasoning trajectories encourage the student model to develop robust reasoning skills.

In the last few months, large language models (LLMs) with Chain-of-Thought (CoT)(Wei et al. 2023) reasoning emerge as the most potential pathway to achieve Artificial General Intelligence (AGI). In the pursuit of emergence of reasoning capabilities, two key techniques (DeepSeek-AI et al. 2025), Reinforcement Learning with Verifiable Reward (RLVR) and Supervised Fine-tuning (SFT) from distilled reasoning trajectories, attract the most attention of the community. DeepSeek-R1-Zero (DeepSeek-AI et al. 2025) showed that large scale RLVR, together with the proposed Group Relative Policy Optimization (GRPO)(Shao et al. 2024), could incentivize reasoning ability from scratch. Meanwhile, the open-sourced DeepSeek-R1-Distill-Qwen series models and relative studies (NovaSky Team 2025; Bespoke Labs 2025; Muennighoff et al. 2025; Ye et al. 2025) showed that distillation might be a more practical way compared to RLVR when there exists a powerful reasoning LLM as the teacher. To take a step forward, recent distillation methods foster the reasoning of LLMs in two directions: (1) Enlarge the distilled CoT corpus, combined with one-stage selection. (2) Multistage training with iterative RLVR and distillation (He et al. 2025; Wen et al. 2025). A potential scaling law takes shape as illustrated in Figure 1. To lift up the scaling curve, researchers have proposed several metrics to filter high-quality examples from the distilled CoT corpus (Muennighoff et al. 2025), including the difficulty, token length and diversity in domains. These achievements raises a natural question: Can we skip out of the potential scaling law? Specifically, can we push the boundary of reasoning ability with limited examples? Inspired by the great improvement and systematic analysis

Our framework integrates three key innovations: (1) A practical strategy to select the teacher model; (2) A practical compression of number of questions to reduce the damage to the OOD capabilities. (3) Diverse problem-solving trajectories for each question. To demonstrate the effectiveness of our framework, we trained and open-sourced NTele– 32B-V1, the state-of-the-art (SOTA) reasoning models of its parameters. we evaluate our model on the most commonly used reasoning benchmarks. As illustrated in Figure 1, our model greatly outperformed other baselines, especially considering the limited scale of training corpus.