Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Note · 2026-02-22 · sourced from Reinforcement Learning

Existing length-control approaches for reasoning use fixed token budgets during training. Train Long Think Short proposes instead a curriculum: start with generous budgets and gradually tighten them. The intuition is that learning has two phases — exploration (discovering effective strategies) and compression (distilling strategies into concise traces) — and these phases have different budget needs.

The reward function balances three signals: task correctness via verifier feedback, length efficiency, and formatting adherence via structural tags. The curriculum aspect controls the length efficiency signal over training, becoming progressively more demanding.

Across five benchmarks (GSM8K, MATH500, SVAMP, College Math, GSM+), curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. The key is that the generous early phase allows the model to explore diverse solution strategies without being penalized for verbosity, then the tightening phase forces compression of only the strategies that actually work.

This connects to the broader overthinking cluster. Since Why does chain of thought accuracy eventually decline with length?, the curriculum approach may naturally navigate this inverted-U by allowing the model to find the peak during exploration and then descend toward conciseness. And since Can minimal reasoning chains match full explanations?, the compression phase is not sacrificing quality — it's removing the filler that makes no real progress.

The deeper principle is that exploration and exploitation require different resource allocations, and temporal scheduling of these allocations (generous first, tight later) outperforms any fixed compromise. This generalizes beyond token budgets to task ordering.

Cognitive science grounding: The CURIOUS algorithm (Colas et al.) provides the developmental robotics foundation for this principle. In open-ended environments, autonomous agents that bias attention toward goals maximizing absolute learning progress naturally self-organize a developmental curriculum — focusing sequentially on goals of increasing complexity, and importantly, refocusing on goals that are being forgotten. The robustness to distracting goals and changing body properties suggests the learning-progress signal is a more general curriculum principle than task difficulty alone. The connection to RL reasoning training: the "generous early, tight later" budget curriculum may succeed precisely because early generosity maximizes learning progress (many strategies discovered per token), while later tightening maximizes efficiency (compression without new discovery needed).

Backward transfer extends curriculum from temporal budgets to task ordering: Omni-Thinker's BWT-guided scheduling reveals that the dimension that matters for multi-task RL is not just how much compute per task, but which tasks come first. Structured domains (math, coding) decrease output entropy while creative domains (writing, dialogue) increase it. Training structured tasks first, then creative tasks, preserves both capabilities. Training creative tasks first risks having structured training collapse the entropy creative training expanded. The ordering effect is predictable from backward transfer measurements, giving practitioners a principled scheduling criterion. See Does training order reshape how models handle different task types?. Together with token-budget curriculum, this suggests RL training benefits from curriculum design along multiple dimensions simultaneously — budget generosity over time AND task type ordering across the training run.

Source: Reinforcement Learning, Reward Models

Related concepts in this collection

Does training order reshape how models handle different task types? Explores whether the sequence of multi-task RL training systematically affects model capabilities across structured and creative domains, and whether this ordering effect can be predicted and optimized.
extends curriculum from temporal budgets to task ordering: BWT-guided scheduling is a second curriculum dimension
Why does chain of thought accuracy eventually decline with length? Explores why longer reasoning chains don't always improve answers, and how the optimal length shifts based on task difficulty and model capability.
supports: curriculum navigates the inverted-U naturally
Can minimal reasoning chains match full explanations? Does removing all explanatory text from chain-of-thought reasoning preserve accuracy? This tests whether verbose intermediate steps are necessary for solving problems or just artifacts of how language models are trained.
validates: compression phase removes filler without losing quality
Can we reward reasoning steps without human annotation? Existing RL for reasoning uses only final-answer rewards, causing models to produce wastefully long chains. Can information theory provide dense, automatic feedback for individual reasoning steps?
connects: curriculum approach may achieve similar efficiency gains through a simpler mechanism
Can language models improve themselves without any external training data? Explores whether two language models playing against each other—one generating questions, one solving them—can create a self-improving loop. Matters because it would eliminate dependence on human-labeled datasets.
SQLM's proposer-solver dynamic creates an emergent curriculum analogous to the budget curriculum: the proposer automatically calibrates problem difficulty to the solver's frontier (neither too easy nor too hard), producing the same explore-then-compress dynamic but through adversarial generation rather than temporal budget scheduling
Can adaptive guidance from solution traces reduce reward sparsity in RL? When reinforcement learning struggles with hard problems due to sparse rewards and zero-advantage rollouts, does providing partial solution traces as adaptive guidance help the model learn more efficiently? This matters because standard RL wastes compute on unsolvable problems.
GHPO operationalizes adaptive curriculum via a different lever: instead of tightening budget over time, it provides solution-trace guidance calibrated to problem difficulty, converting zero-advantage rollouts into learning signal
Can reinforcement learning optimize therapy dialogue in real time? Can RL systems trained on working alliance scores recommend therapy topics that improve clinical outcomes during live sessions? This explores whether validated clinical constructs can serve as reward signals for dialogue optimization.
R2D2's three-level architecture (backbone RL to content-enriched to personalized) mirrors the curriculum principle in a clinical domain: progressive specialization from general therapeutic strategies to disorder-specific to patient-personalized policies

Concept map

19 direct connections · 185 in 2-hop network ·dense cluster

Does gradually tightening token budgets beat fix… Does training order reshape how models handle diff… Why does chain of thought accuracy eventually decl… Can minimal reasoning chains match full explanatio… Can we reward reasoning steps without human annota… Can language models improve themselves without any… Can adaptive guidance from solution traces reduce … Can reinforcement learning optimize therapy dialog…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

curriculum budgets that start generous and gradually tighten outperform fixed-budget rl for reasoning efficiency