LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can continuous reasoning avoid forgetting in instruction-tuned models?

Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?

Note · 2026-04-20 · sourced from Cognitive Models Latent

Continuous-space reasoning methods like Coconut and Compressed CoT have shown promising results by replacing discrete token sequences with latent representations. However, these methods require full-model fine-tuning — and when applied to already-capable instruction-tuned models like LLaMA-3.1-8B-Instruct and Qwen2.5-7B-Instruct, performance degrades below zero-shot CoT. The degradation is attributable to catastrophic forgetting: the models already have strong reasoning capability that fine-tuning for continuous-space operations destroys.

This is an important practical finding because it reveals a gap between proof-of-concept (Coconut works on GPT-2) and deployment reality (Coconut's approach fails on the models people actually use). The capability that makes instruction-tuned models valuable is exactly what full fine-tuning compromises.

SoftCoT resolves this by architectural separation: freeze the backbone LLM entirely and delegate continuous thought generation to a small auxiliary assistant model. The assistant generates a sequence of "soft thought tokens" — last-layer hidden states conditioned on the task instruction and specific instance. These soft thoughts are mapped into the LLM's representation space via a trainable projection module, then prepended as instance-specific prompts.

The design draws on two established ideas. From prompt tuning: the soft thoughts function as learned instance-adaptive prompts that tailor the LLM's behavior per problem. From speculative decoding: a small model generates proposals that a large model consumes. The projection module bridges the representational gap between assistant and backbone, and training this module for each task is equivalent to soft prompt tuning.

By staying in the latent space (using hidden states rather than decoded tokens from the assistant), SoftCoT avoids the information loss inherent in autoregressive decoding while preserving the backbone's pre-trained knowledge completely.

The contrast with Can we explore multiple reasoning paths without committing to one token? is instructive: Soft Thinking is training-free and operates within a single model by modifying inference. SoftCoT requires training the assistant + projection module but achieves cross-model continuous reasoning — the assistant can be small and cheap while the backbone remains frozen and capable. They address different deployment scenarios: Soft Thinking for zero-cost enhancement, SoftCoT for task-specific optimization without backbone risk.

The forgetting finding also validates the architectural choice in Can models reason without generating visible thinking tokens?: Coconut's continuous thought approach works when training from scratch but fails as a retrofit to existing capable models. This suggests the field needs both training-time latent reasoning architectures (for new models) and inference-time or frozen-backbone approaches (for enhancing existing models).


Source: Cognitive Models Latent Paper: SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs

Related concepts in this collection

Concept map
14 direct connections · 122 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

SoftCoT preserves frozen LLM reasoning by delegating continuous thought generation to a lightweight assistant model — avoiding catastrophic forgetting from full continuous-space training