Soft Tokens, Hard Truths
Large Language Models (LLMs) have achieved impressive success across a wide range of reasoning tasks, particularly when enhanced with Chain-of-Thought (CoT) prompting, where models generate intermediate “thinking tokens” before producing final answers. While effective, standard CoT is constrained by the discreteness of language tokens: each intermediate step must be sampled sequentially, which can limit expressivity and hinder exploration of diverse reasoning paths. This contrasts sharply with human cognition, which often operates over abstract and fluid concepts rather than rigid linguistic symbols. Motivated by this gap, recent work has explored enabling LLMs to reason in continuous concept spaces, a direction often termed “continuous CoTs” (Hao et al., 2024) or “Soft Thinking” (Zhang et al., 2025).
From a theoretical perspective, continuous reasoning offers significant potential. Reasoning by Superposition (Zhu et al., 2025a) shows that continuous thought vectors can act as superposition states, encoding multiple search frontiers in parallel and enabling efficient breadth-first reasoning. This construction allows a shallow transformer to solve problems such as directed graph reachability far more efficiently than discrete CoT, which is forced into sequential exploration and risks being trapped in local solutions. Complementarily, Soft Thinking (Zhang et al., 2025) proposes replacing discrete (“hard”) tokens with concept tokens—probability-weighted mixtures of embeddings—that retain full distributional information. This enables the model to implicitly follow multiple reasoning paths simultaneously, yielding empirical improvements in both accuracy and token efficiency.
Despite these promising claims, the practical benefits of continuous reasoning at inference time on top of discrete-token base models remain contested. In particular, Wu et al. (2025) critically re-examine Soft Thinking and find that vanilla implementations often underperform their discrete counterparts. Their analysis suggests that LLMs, when given soft inputs, default to relying on the single highest-probability token—effectively reducing Soft Thinking to greedy decoding. Further, existing methods for soft thinking are limited to inference on models trained with discrete CoTs.
Training of continuous-token reasoning models has proven to be difficult, either due to computational constraints from full backpropagation through all steps of continuous reasoning (this limited the CoT to 6 steps in Hao et al. (2024)), or due to the necessity of strongly grounding the continuous reasoning into ground-truth discrete reasoning traces (Shen et al., 2025). This is why several of the works above limit themselves to applying continuous reasoning at inference time without training (Zhang et al., 2025; Wu et al., 2025).
In this work, we address these limitations by developing an approach to reinforce continuous CoTs with controlled noise, making them amenable to reinforcement learning (RL) training. We theoretically outline two types of continuous CoT learning with soft and fuzzy tokens (see Figure 1) and provide extensive empirical evidence with Llama-3.x and Qwen-2.5 models trained on a number of mathematical datasets (GSM8K, MATH, DeepScaleR) and evaluate on a variety of mathematical and out-of-domain benchmarks. Our contributions and findings are as follows: