LLM Reasoning and Architecture Reinforcement Learning for LLMs

Can we explore multiple reasoning paths without committing to one token?

Standard language models pick one token at each step, collapsing uncertainty and forcing single reasoning trajectories. Could preserving the full probability distribution across token embeddings enable implicit parallel exploration instead?

Note · 2026-02-23 · sourced from Cognitive Models Latent

Standard CoT commits to a single token at each step, collapsing the probability distribution. This forces a single reasoning trajectory, which can lead down incorrect paths, especially for problems with high uncertainty or multiple plausible directions. Soft Thinking takes a different approach: instead of selecting one token, it constructs a new embedding from the probability-weighted mixture of ALL token embeddings — a "concept token" that preserves the full next-token distribution.

Each concept token encapsulates multiple meanings from related discrete tokens, enabling smooth transitions in a continuous concept space rather than discrete jumps between fixed semantic points. The concept token naturally preserves a "superposition" of possible reasoning paths that are implicitly explored in parallel.

Two mechanisms make this work:

Continuous concept space. The probability-weighted interpolation across embeddings creates a space where nearby points represent related but distinct meanings. The model can express intermediate concepts that don't correspond to any single token — capturing abstract reasoning that falls between discrete words.

Cold Stop. The entropy of the output distribution is monitored at each step. When the model shows high confidence (low entropy) over several consecutive steps, reasoning terminates early. This prevents two problems: unnecessary computation when the model has already converged on an answer, and generation collapse (repetition) caused by out-of-distribution concept tokens that weren't seen during training.

The empirical results validate both mechanisms: pass@1 accuracy improves by up to 2.48 points while reducing token usage by up to 22.4% compared to standard CoT. The efficiency gain comes from Cold Stop, while the accuracy gain comes from implicit parallel exploration.

The contrast with Coconut is instructive. Can models reason without generating visible thinking tokens? describes reasoning in continuous latent space but requires training modifications. Soft Thinking achieves a similar effect — continuous-space reasoning with implicit path exploration — without any training. It works by changing the inference procedure alone, applied to any existing model. This makes it complementary to Why does parallel reasoning outperform single chain thinking?: Soft Thinking achieves parallelism within a single generation stream rather than through multiple independent samples.

SoftCoT validates the training-free design by showing the failure mode of the alternative. When capable instruction-tuned models (LLaMA3.1-8B-Instruct, Qwen2.5-7B-Instruct) are fine-tuned for continuous reasoning using Coconut/CCoT's language modeling objective, performance degrades below zero-shot CoT — catastrophic forgetting destroys the reasoning capability that makes these models useful. SoftCoT's solution (freeze the LLM, delegate continuous thought generation to a small assistant model with a trainable projection) is architecturally distinct from Soft Thinking but shares the same premise: don't modify the backbone. Where Soft Thinking modifies inference within one model, SoftCoT introduces a cross-model architecture for task-specific continuous reasoning. The forgetting finding is the strongest practical argument for training-free or frozen-backbone approaches to continuous-space reasoning. See Can continuous reasoning avoid forgetting in instruction-tuned models?.

Source: Cognitive Models Latent

Related concepts in this collection

Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
Soft Thinking achieves implicit parallelism within a single stream rather than across samples
Can models reason without generating visible thinking tokens? Explores whether intermediate reasoning must be verbalized as text tokens, or if models can think in hidden continuous space. Challenges a foundational assumption about how language models scale their reasoning capabilities.
Coconut requires training; Soft Thinking is training-free; both operate in continuous concept space
Can minimal reasoning chains match full explanations? Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
CoD reduces tokens via brevity; Soft Thinking reduces tokens via Cold Stop; both challenge the "more tokens = better reasoning" assumption
Does more thinking time always improve reasoning accuracy? Explores whether extending a model's thinking tokens linearly improves performance, or if there's a point beyond which additional reasoning becomes counterproductive.
Cold Stop provides a principled mechanism for avoiding overthinking
Can continuous reasoning avoid forgetting in instruction-tuned models? Full fine-tuning for continuous-space reasoning degrades performance in capable instruction-tuned models. Why does this happen, and can architectural changes prevent it?
validates training-free design: full fine-tuning for continuous reasoning causes catastrophic forgetting on capable models

Concept map

17 direct connections · 165 in 2-hop network ·dense cluster

Can we explore multiple reasoning paths without … Why does parallel reasoning outperform single chai… Can models reason without generating visible think… Can minimal reasoning chains match full explanatio… Does more thinking time always improve reasoning a… Can continuous reasoning avoid forgetting in instr…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

soft thinking generates continuous concept tokens that implicitly explore multiple reasoning paths in parallel without training