SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to solve complex reasoning tasks by generating intermediate reasoning steps. However, most existing approaches focus on hard token decoding, which constrains reasoning within the discrete vocabulary space and may not always be optimal. While recent efforts explore continuous space reasoning, they often require full-model fine-tuning and suffer from catastrophic forgetting, limiting their applicability to state-of-the-art LLMs that already perform well in zeroshot settings with a proper instruction. To address this challenge, we propose a novel approach for continuous-space reasoning that does not require modifying the LLM. Specifically, we employ a lightweight fixed assistant model to speculatively generate instance specific soft thought tokens as the initial chain of thoughts, which are then mapped into the LLM’s representation space via a trainable projection module. Experimental results on five reasoning benchmarks demonstrate that our method enhances LLM reasoning performance through supervised, parameter-efficient fine-tuning. Source code is available at https: //github.com/xuyige/SoftCoT.
In recent years, Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP), exhibiting advanced natural language understanding and generation (Brown et al., 2020; Du et al., 2022; Chowdhery et al., 2023; OpenAI, 2023; Touvron et al., 2023; Dubey et al., 2024; Yang et al., 2024). Scaling model sizes has not only improved instruction-following (Kojima et al., 2022) but also triggered emergent reasoning abilities, as first evidenced by chain-of-thought (CoT) prompting (Wei et al., 2022). CoT prompts LLMs to generate intermediate reasoning steps before providing the final answer, which not only enhances interpretability but also improves a range of reasoning-intensive tasks (Zhang et al., 2023; Sprague et al., 2024). It has inspired many advanced prompting frameworks, marking a paradigm shift from scaling training-time compute (Kojima et al., 2022) to scaling inference-time compute (Wang et al., 2023; Yao et al., 2023) to further boost LLM performance.
Nevertheless, CoT’s effectiveness depends on the quality of intermediate thoughts, as the autoregressive generation process can propagate errors. To mitigate this challenge, methods like self consistency (Wang et al., 2023) generate multiple reasoning paths, while Tree-of-Thought (Yao et al., 2023) and Graph-of-Thought (Besta et al., 2024) frameworks organize these paths to select higher quality steps. Despite these improvements, such methods are computationally inefficient due to the need for extensive thought sampling. To enhance CoT efficiency, recent research explores skipping the decoding of hard tokens at intermediate steps. Methods like Compressed CoT (Cheng and Durme, 2024) and Coconut (Hao et al., 2024) conduct reasoning in a continuous space by using latent representations instead of discrete token sequences. Their results have shown to be superior to long-sequence discrete reasoning chains using only a short length of continuous representation. Yet, these methods require full-model fine-tuning, which incurs substantial computational costs, risks catastrophic forgetting, and limits their transferability across tasks.
We empirically observed that fine-tuning LLaMA3.1-8B (Dubey et al., 2024) for continuous space reasoning using a language modeling objective (as employed by Coconut and CCoT) results in performance degradation compared to zero-shot CoT (Tables 2 and 3). Drawing on a widely accepted definition of catastrophic forgetting (KalaarXivjdzievski, 2024; Lobo et al., 2024), defined as the degradation of previously learned capabilities after fine-tuning on new data, we conjecture that this drop in reasoning performance is attributable to catastrophic forgetting. This phenomenon appears particularly pronounced in already capable instruction-tuned models such as LLaMA-3.1-8BInstruct and Qwen2.5-7B-Instruct, which exhibit strong zero-shot CoT reasoning abilities. Thus, the methodologies of Coconut, which is based on GPT- 2 (Radford et al., 2019), may not be directly applicable to more recent models such as LLaMA3.1 and Qwen2.5 series. Therefore, it is crucial to explore alternative methodologies that mitigate catastrophic forgetting while effectively leveraging continuous reasoning techniques in large-scale, instruction-tuned models, which is the main research goal of this work. To the best of our knowledge, we are the first to systematically identify and address the forgetting issue.
To mitigate catastrophic forgetting, a straightforward approach is to freeze the backbone LLM and instead optimize an external model for reasoning. Inspired by prompt tuning (Lester et al., 2021) and speculative decoding (Leviathan et al., 2023), we propose to utilize an auxiliary small assistant model to generate a sequence of “thought” tokens conditioned on a task instruction followed by a specific instance (Li et al., 2023; Shao et al., 2023). These tokens serve as instance-specific prompts that adapt to different problems to boost LLM’s reasoning. Such an auxiliary prompting mechanism allows the LLM to achieve better generalization while preserving its pre-trained knowledge.
To exploit continuous-space reasoning, we use the last-layer hidden states from the small assistant model as the “soft” thought tokens, rather than the discrete tokens obtained after vocabulary mapping. Staying in the latent space avoids information loss inherent in autoregressive decoding. However, a representational gap between the assistant model and the LLM may hinder effective knowledge transfer. To bridge this gap, we train a projection module to map the soft thought tokens generated by the assistant model to the LLM’s representation space. Training the projection module for each task can be seen as soft prompt tuning for the LLM. The overall Soft thoughts for CoT (SoftCoT) reasoning framework is illustrated in Figure 1.