Thinkless: LLM Learns When to Think

Paper · arXiv 2505.13379 · Published May 19, 2025
Reasoning o1 o3 Search

Reasoning Language Models, capable of extended chain-of-thought reasoning, have demonstrated remarkable performance on tasks requiring complex logical inference. However, applying elaborate reasoning for all queries often results in substantial computational inefficiencies, particularly when many problems admit straightforward solutions. This motivates an open question: Can LLMs learn when to think? To answer this, we propose Thinkless, a learnable framework that empowers an LLM to adaptively select between short-form and long-form reasoning, based on both task complexity and the model’s ability. Thinkless is trained under a reinforcement learning paradigm and employs two control tokens, short for concise responses and think for detailed reasoning. At the core of our method is a Decoupled Group Relative Policy Optimization (DeGRPO) algorithm, which decomposes the learning objective of hybrid reasoning into two components: (1) a control token loss that governs the selection of the reasoning mode, and (2) a response loss that improves the accuracy of the generated answers. This decoupled formulation enables fine-grained control over the contributions of each objective, stabilizing training and effectively preventing collapse observed in vanilla GRPO.

Despite promising results, a central challenge persists: determining when a model should engage in elaborate reasoning. Many existing approaches address this by incorporating manually designed heuristics, such as fixed computational budgets [1] or prompt-level control signals like “reasoning on/off” [4, 36]. However, these strategies inherently rely on human prior knowledge and may yield suboptimal or inappropriate control decisions. This underscores a fundamental open question: Can an LLM learn to decide when to think, guided by the complexity of the task and its own capability?

Despite promising results, a central challenge persists: determining when a model should engage in elaborate reasoning. Many existing approaches address this by incorporating manually designed heuristics, such as fixed computational budgets [1] or prompt-level control signals like “reasoning on/off” [4, 36]. However, these strategies inherently rely on human prior knowledge and may yield suboptimal or inappropriate control decisions. This underscores a fundamental open question: Can an LLM learn to decide when to think, guided by the complexity of the task and its own capability?

Motivated by this, we explore the fundamental form of hybrid reasoning, where the model is tasked with autonomously deciding whether to generate a short-form or long-form response based on the input query. This decision is guided by three core factors: (1) the complexity of the query, as simpler questions generally merit concise responses, while more intricate ones may necessitate extended reasoning; (2) the capability of the model, since more powerful models are better positioned to employ short reasoning without sacrificing accuracy, whereas less capable models may benefit from longer responses to preserve performance; and (3) the user’s tolerance for the trade-off between efficiency and accuracy, which determines the acceptable level of performance degradation when opting for shorter reasoning. Naturally, reinforcement learning [30, 12, 43] offers a framework to unify these factors, as it allows the model to learn from interactions that reflect both environmental feedback and user-defined preferences. Through iterative exploration and reward-driven updates, the model progressively acquires the ability to make autonomous, context-aware decisions about its reasoning strategy, balancing accuracy and efficiency in a dynamic and data-driven manner.

Building on these insights, we propose Thinkless, a reinforcement learning framework designed to train a hybrid reasoning model capable of selecting between short-form and long-form responses. As illustrated in Figure 3, Thinkless employs two control tokens, think and short, which are generated as the first token in the model’s output to signal the intended inference style.

Distillation for Warm-up. In the warm-up phase, the model aligns its response style with the designated control tokens via a distillation process. Specifically, it learns to imitate the behavior of two expert models: a reasoning model and a standard instruction-following model, each conditioned on a specific control token (think or short). Additionally, the model is trained on paired longform and short-form responses for each query, ensuring it can generate both styles with comparable likelihood. This initialization establishes a clear and robust mapping between control tokens and response formats, providing diverse outputs for subsequent reinforcement learning. Reinforcement Learning with Decoupled GRPO. In the reinforcement learning phase, the model is optimized to select the appropriate inference mode based on performance feedback. A natural starting point for this task is the vanilla Group Relative Policy Optimization (GRPO) [12] framework. However, when applied to hybrid reasoning, vanilla GRPO treats all tokens, including the control token and the response tokens uniformly. This introduces a critical imbalance: since the response part often spans hundreds to thousands of tokens and the length of long & short responses varies significantly, the single control token may receive weak and biased gradient signals, ultimately leading to mode collapse at the early stages of training. To this end, we propose a tailored method for hybrid reasoning, termed Decoupled Group Relative Policy Optimization (DeGRPO). As illustrated in Figure 1, DeGRPO explicitly separates the hybrid reasoning objective into two components: (1) Mode Selection, which governs how quickly the policy adapts based on the model’s current accuracy; and (2) Accuracy Improvement, which refines the response content to improve answer correctness under the selected reasoning mode. These two components are inherently interdependent, and effective training requires carefully balancing the learning signals for the control token and the response tokens.

While prior work has largely focused on compressing reasoning paths to reduce token generation, an alternative path to efficiency is hybrid reasoning, which dynamically adapts the appropriate inference behaviour based on task complexity [2]. This approach allows models to flexibly alternate between short-form responses and long-chain reasoning as needed. Hybrid reasoning can be realized either through collaborative systems involving multiple models [28, 21] or within a single unified model [2, 36, 4]. In multi-model frameworks, routing mechanisms [28] or speculative decoding techniques [21] are commonly employed. For example, a lightweight model may generate a preliminary answer that a larger model verifies or refines.

After the distillation phase, the model can produce both long- and short-form answers. what it still lacks is a mechanism for deciding which reasoning mode suits a particular input x. To supply this capability we frame mode selection as a reinforcement-learning problem and optimize a policy πθ(c, a | x) = πθ(c | x) πθ(a | x, c), where the first token c ∈ C = {<-short->, <-think->} serves as a control token that determines the reasoning mode, and the subsequent tokens (ai,1, . . . , ai,Ti ) constitute the generated response.

To construct long-short paired responses for the distillation phase, we utilize long-form data from open-source datasets [35, 10] generated by the DeepSeek-R1-671B model [12], which is well-suited for multi-step reasoning. The corresponding short-form answers are derived using Qwen2.5-Math-1.5B-Instruct [41], a compact instruction-tuned model optimized for concise mathematical responses. The hybrid model is direcytly fine-tuned on this paired dataset via supervised fine-tuning, enabling it to accommodate both long and short reasoning styles. The model is then further optimized using the Decoupled Generalized Reweighted Policy Optimization (GRPO) algorithm.

Mode Collapse in RL. To further analyze how the model learns a reasonable policy, we visualize the training process of RL. Figure 3 (a) illustrates the Mode Collapse issue in standard GRPO, where the model develops an excessive preference for either long or short outputs during training. In conventional GRPO, the gradient on the control token is normalized by the total length of the response, which introduces an imbalance between long and short outputs. Specifically, long-chain samples, due to having more tokens, receive slower updates on the think token, while samples encouraging short dominate the updates. This imbalance causes the model to collapse rapidly,

It can be observed that samples assigned to the short reasoning mode are typically simple arithmetic problems that do not require deep or complex reasoning. In contrast, questions routed to the thinking mode tend to be more complex, involving multiple conditions and concepts. Overall, the results reflect a well-calibrated policy that adapts reasoning depth based on task complexity.