Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst

Paper · arXiv 2505.14116 · Published May 20, 2025
Self Refinement Self Consistency Feedback

In this work, we introduce Self-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model’s initial performance but also ensures more stable and consistent improvements in subsequent iterations

However, the scarcity of these longer CoT data remains a significant obstacle to advancing inference-time scaling, especially for instructions without verifiable answers in the general instruction-tuning data.

Several prior studies have explored various approaches to get better CoT tuning datasets, where most of them utilize either the LLM itself (Wang et al., 2023c; Huang et al., 2023; Wang et al., 2023a) or external reasoning models (Li et al., 2024a; Hu et al., 2024) to generate (or refine) new (or existing) responses in the instruction-tuning dataset, leading to great improvements at various downstream tasks. Despite the effectiveness of these proposed methods, they still face several limitations. On the one hand, most of them focus on questions with verifiable answer such as math and code (Huang et al., 2023; DeepSeek-AI et al., 2025), being infeasible for general instructiontuning dataset. On the other hand, another line of work typically assumes access to more powerful models to refine each sample iteratively (Xu et al., 2023; Li et al., 2024a). Such methods suffer from performance plateaus or even degradation (Ding et al., 2024) and are inherently constrained by the capability ceiling of the powerful model.

To this end, we present Self-Reasoning Language Models (SRLM), which is capable to selfunfolding its own reasoning rationales and iteratively optimize itself, leading to enhanced overall capability. Specifically, we first create only few reasoning catalyst data that compose the demonstrations of how to enrich shorter CoT rationales into more longer and comprehensive CoT with the augmentation of various meta-reasoning skills. After incorporating the reasoning catalyst data with the original instruction-tuning data, the tuned model not only inherit the basic reasoning capabilities from the instruction-tuning dataset, but also learn how to refine reasoning simultaneously, resulting in Self-Reasoning Language Models. Consequently, the SRLM can refine its own reasoning rationales at each iteration with the processing of reasoning expansion and selection. During this process, the model generates enriched reasoning rationale candidates for the same instructions in the original instruction-tuning dataset. These rationale pairs are then filtered and selected using three proposed selectors without any prior assumption about the instruction and answer. Finally, the newly selected instruction-tuning dataset is combined with the reasoning catalyst data to create the training data for the next iteration of SRLM, which is initialized from the same base model.

As shown in Figure 2, there are two main phases in our method. In phase 1, we leverage existing LLMs to acquire seed data of reasoning augmentation, and function as reasoning catalyst to enlight the models to enrich existing reasoning rationales ($3.2.1). In phase 2, the model then is fine-tuned using both original instruction-tuning data and reasoning catalyst data, and iteratively refined using self-generated new rationales, resulting in self-improved models ($3.2.2).