Can continuous reasoning avoid forgetting in instruction-tuned models?
Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
Continuous-space reasoning methods like Coconut and Compressed CoT have shown promising results by replacing discrete token sequences with latent representations. However, these methods require full-model fine-tuning — and when applied to already-capable instruction-tuned models like LLaMA-3.1-8B-Instruct and Qwen2.5-7B-Instruct, performance degrades below zero-shot CoT. The degradation is attributable to catastrophic forgetting: the models already have strong reasoning capability that fine-tuning for continuous-space operations destroys.
This is an important practical finding because it reveals a gap between proof-of-concept (Coconut works on GPT-2) and deployment reality (Coconut's approach fails on the models people actually use). The capability that makes instruction-tuned models valuable is exactly what full fine-tuning compromises.
SoftCoT resolves this by architectural separation: freeze the backbone LLM entirely and delegate continuous thought generation to a small auxiliary assistant model. The assistant generates a sequence of "soft thought tokens" — last-layer hidden states conditioned on the task instruction and specific instance. These soft thoughts are mapped into the LLM's representation space via a trainable projection module, then prepended as instance-specific prompts.
The design draws on two established ideas. From prompt tuning: the soft thoughts function as learned instance-adaptive prompts that tailor the LLM's behavior per problem. From speculative decoding: a small model generates proposals that a large model consumes. The projection module bridges the representational gap between assistant and backbone, and training this module for each task is equivalent to soft prompt tuning.
By staying in the latent space (using hidden states rather than decoded tokens from the assistant), SoftCoT avoids the information loss inherent in autoregressive decoding while preserving the backbone's pre-trained knowledge completely.
The contrast with Can we explore multiple reasoning paths without committing to one token? is instructive: Soft Thinking is training-free and operates within a single model by modifying inference. SoftCoT requires training the assistant + projection module but achieves cross-model continuous reasoning — the assistant can be small and cheap while the backbone remains frozen and capable. They address different deployment scenarios: Soft Thinking for zero-cost enhancement, SoftCoT for task-specific optimization without backbone risk.
The forgetting finding also validates the architectural choice in Can models reason without generating visible thinking tokens?: Coconut's continuous thought approach works when training from scratch but fails as a retrofit to existing capable models. This suggests the field needs both training-time latent reasoning architectures (for new models) and inference-time or frozen-backbone approaches (for enhancing existing models).
Source: Cognitive Models Latent Paper: SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs
Related concepts in this collection
-
Can we explore multiple reasoning paths without committing to one token?
Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?
complementary approach: training-free single-model vs trained cross-model; SoftCoT's forgetting finding validates Soft Thinking's design
-
Can models reason without generating visible thinking tokens?
Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
Coconut works from-scratch but fails as retrofit; SoftCoT provides the retrofit-safe alternative
-
Does LLM forgetting mean knowledge loss or alignment loss?
When language models lose performance on old tasks after learning new ones, is the underlying knowledge actually erased, or does the model simply lose its ability to apply it? Understanding this distinction could reshape how we think about AI safety and continual learning.
SoftCoT's catastrophic forgetting finding is the genuine version: full fine-tuning for continuous reasoning destroys capability that cannot be trivially recovered, unlike spurious task-alignment loss
-
Can latent thought vectors scale language models beyond parameters?
Explores whether explicit latent thought vectors with dual-rate learning create new scaling dimensions independent of model size. This matters because it suggests alternatives to simply building larger models.
LTMs train from scratch with latent vectors; SoftCoT retrofits latent reasoning onto existing models via frozen backbone + assistant; different solutions to the same goal
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
SoftCoT preserves frozen LLM reasoning by delegating continuous thought generation to a lightweight assistant model — avoiding catastrophic forgetting from full continuous-space training