Reinforcement Learning for LLMs LLM Reasoning and Architecture

Can models learn when to think versus respond quickly?

Can a single LLM learn to adaptively choose between extended reasoning and concise responses based on task complexity? This matters because it could optimize compute efficiency without sacrificing accuracy on hard problems.

Note · 2026-02-22 · sourced from Reasoning o1 o3 Search

The question "can LLMs learn when to think?" has a concrete answer. Thinkless trains a single model to adaptively select between extended chain-of-thought reasoning and concise direct responses, guided by three factors: task complexity, model capability, and the user's efficiency-accuracy tolerance.

The mechanism: two control tokens (<think> and <short>) are generated as the first output token, signaling the reasoning mode. A distillation warm-up phase aligns each token with expert behavior — a reasoning model for <think>, a compact instruction model for <short>. Then RL optimizes the routing policy.

The critical technical contribution is DeGRPO (Decoupled Group Relative Policy Optimization). Vanilla GRPO treats all tokens uniformly, but the control token is one token while the response spans hundreds to thousands. Long responses dominate gradient updates, causing the single control token to receive weak, biased signals. The model rapidly collapses to one mode — typically <short>, since short samples update the control token faster.

DeGRPO separates two objectives: (1) mode selection — how quickly the policy adapts based on current accuracy, and (2) accuracy improvement — refining answer content within the selected mode. This decoupling stabilizes training and prevents the mode collapse observed in all vanilla GRPO experiments.

The result: the model self-calibrates. Simple arithmetic routes to <short>. Multi-condition problems with multiple concepts route to <think>. The policy reflects a well-calibrated difficulty assessment without explicit difficulty labels in training.

This is the concrete instantiation of Does RL teach reasoning or just when to use it?. RL doesn't teach the model to reason — it teaches it to recognize when reasoning is worth the compute. The capability comes from pre-training and distillation; RL manages the deployment decision. The design premise aligns with Do base models already contain hidden reasoning ability?: if reasoning capability is already latent, then what's needed is not more capability training but a routing mechanism -- and the DeGRPO control token is exactly that routing mechanism.

The connection to Can we allocate inference compute based on prompt difficulty? is architecturally direct. Compute-optimal scaling proposes adaptive budget allocation as a principle. Thinkless implements it as a learned routing mechanism inside a single model.

Three-mode taxonomy with two knowledge boundaries (from Arxiv/Routers): The Fast, Slow, and Tool-augmented Thinking survey formalizes the decision space Thinkless operates in. Two knowledge boundaries define the taxonomy: (1) a fast/slow boundary separating intuitive from deliberative processes (System 1 vs System 2), and (2) an internal/external boundary distinguishing parameter-grounded reasoning from tool-augmented reasoning. This extends Thinkless's binary think/short routing to a three-mode decision: fast thinking (direct generation), slow thinking (CoT/self-reflection/verification), and tool-augmented thinking (calculators, code interpreters, search). Selection mechanisms are either implicit (learned end-to-end during post-training, no explicit control signal) or explicit (rule-based or model-based external routing). Thinkless is an implicit selector for the fast/slow boundary; extending it to the internal/external boundary would require a third mode for tool invocation decisions.

Source: Reasoning o1 o3 Search

Related concepts in this collection

Does RL teach reasoning or just when to use it? Does reinforcement learning in thinking models actually create new reasoning abilities, or does it simply teach existing capabilities when to activate? This matters for understanding where reasoning truly emerges.
Thinkless is the concrete implementation: RL learns the routing, not the reasoning
Can we allocate inference compute based on prompt difficulty? Does adjusting how much compute each prompt receives—rather than using a fixed budget—improve model performance? Could smarter allocation let smaller models compete with larger ones?
Thinkless implements adaptive allocation as a learned control token decision
When does explicit reasoning actually help model performance? Explicit reasoning improves some tasks but hurts others. What determines whether step-by-step reasoning chains are beneficial or harmful for a given problem?
the routing decision is the practical resolution: use reasoning where it helps, skip where it hurts
Can routers select the right model before generation happens? Explores whether LLMs can be matched to queries by estimating difficulty upfront, before any generation begins. This matters because routing could cut costs significantly while preserving response quality.
external model routing as the inter-model analog of Thinkless's intra-model mode routing
Does RL teach reasoning or teach when to use it? Post-training RL gets credit for building reasoning into language models, but emerging evidence suggests base models already possess this capability. The question is whether RL creates new reasoning skills or simply teaches deployment timing.
Thinkless is the strongest concrete evidence for the post angle: RL literally learns a routing token, not reasoning capability; the "when not how" claim is architecturally explicit in the control token design
Do base models already contain hidden reasoning ability? Explores whether reasoning capability emerges during pre-training as a latent feature rather than being created by post-training methods like reinforcement learning or fine-tuning.
Thinkless's design premise: if reasoning capability is already latent in the base model, what's needed is not more capability training but a routing mechanism that decides when to activate it; DeGRPO is that routing mechanism
When should an agent actually stop and deliberate? How can models detect when deliberation over action choices is genuinely needed versus wasteful? This matters because unbounded action spaces make universal deliberation intractable, yet skipping it entirely risks missing critical errors.
SAND extends the Thinkless routing principle to a finer granularity: Thinkless decides once per response whether to think or not, while SAND decides at each step within an agentic trajectory whether to deliberate; together they form a hierarchy of adaptive compute allocation (response-level routing + step-level gating)
How does thinking emerge from policy selection in RL? Explores whether thinking is fundamentally about selecting between existing sub-policies rather than building new reasoning from scratch. This matters for understanding how RL training unlocks latent capabilities in language models.
theoretical grounding: the thought MDP formalizes what DeGRPO's control token does — selecting between sub-policies (think vs. short) already contained in the policy function; the meta-policy over sub-policies IS the routing decision
When should retrieval happen during model generation? Explores whether retrieval should occur continuously, at fixed intervals, or only when the model signals uncertainty. Standard RAG retrieves once; long-form generation requires dynamic triggering based on confidence signals.
the retrieval-level analog of Thinkless's compute routing: FLARE gates retrieval on low token-probability, Thinkless gates extended thinking on task complexity; both implement uncertainty-triggered compute allocation, one at the retrieval layer, one at the reasoning layer

Concept map

21 direct connections · 206 in 2-hop network ·dense cluster

Can models learn when to think versus respond qu… Does RL teach reasoning or just when to use it? Can we allocate inference compute based on prompt … When does explicit reasoning actually help model p… Can routers select the right model before generati… Does RL teach reasoning or teach when to use it? Do base models already contain hidden reasoning ab… When should an agent actually stop and deliberate? How does thinking emerge from policy selection in …

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

hybrid reasoning via decoupled rl learns when to engage extended thinking versus giving concise responses based on task complexity and model capability