Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does gradually tightening token budgets beat fixed budget training?

Can models learn reasoning more efficiently by starting with generous token allowances and progressively constraining them, rather than training with fixed budgets from the start? This matters because it addresses how to teach models to think effectively while remaining concise.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Existing length-control approaches for reasoning use fixed token budgets during training. Train Long Think Short proposes instead a curriculum: start with generous budgets and gradually tighten them. The intuition is that learning has two phases — exploration (discovering effective strategies) and compression (distilling strategies into concise traces) — and these phases have different budget needs.

The reward function balances three signals: task correctness via verifier feedback, length efficiency, and formatting adherence via structural tags. The curriculum aspect controls the length efficiency signal over training, becoming progressively more demanding.

Across five benchmarks (GSM8K, MATH500, SVAMP, College Math, GSM+), curriculum-based training consistently outperforms fixed-budget baselines at the same final budget, achieving higher accuracy and significantly improved token efficiency. The key is that the generous early phase allows the model to explore diverse solution strategies without being penalized for verbosity, then the tightening phase forces compression of only the strategies that actually work.

This connects to the broader overthinking cluster. Since Why does chain of thought accuracy eventually decline with length?, the curriculum approach may naturally navigate this inverted-U by allowing the model to find the peak during exploration and then descend toward conciseness. And since Can minimal reasoning chains match full explanations?, the compression phase is not sacrificing quality — it's removing the filler that makes no real progress.

The deeper principle is that exploration and exploitation require different resource allocations, and temporal scheduling of these allocations (generous first, tight later) outperforms any fixed compromise. This generalizes beyond token budgets to task ordering.

Cognitive science grounding: The CURIOUS algorithm (Colas et al.) provides the developmental robotics foundation for this principle. In open-ended environments, autonomous agents that bias attention toward goals maximizing absolute learning progress naturally self-organize a developmental curriculum — focusing sequentially on goals of increasing complexity, and importantly, refocusing on goals that are being forgotten. The robustness to distracting goals and changing body properties suggests the learning-progress signal is a more general curriculum principle than task difficulty alone. The connection to RL reasoning training: the "generous early, tight later" budget curriculum may succeed precisely because early generosity maximizes learning progress (many strategies discovered per token), while later tightening maximizes efficiency (compression without new discovery needed).

Backward transfer extends curriculum from temporal budgets to task ordering: Omni-Thinker's BWT-guided scheduling reveals that the dimension that matters for multi-task RL is not just how much compute per task, but which tasks come first. Structured domains (math, coding) decrease output entropy while creative domains (writing, dialogue) increase it. Training structured tasks first, then creative tasks, preserves both capabilities. Training creative tasks first risks having structured training collapse the entropy creative training expanded. The ordering effect is predictable from backward transfer measurements, giving practitioners a principled scheduling criterion. See Does training order reshape how models handle different task types?. Together with token-budget curriculum, this suggests RL training benefits from curriculum design along multiple dimensions simultaneously — budget generosity over time AND task type ordering across the training run.


Source: Reinforcement Learning, Reward Models

Related concepts in this collection

Concept map
19 direct connections · 185 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

curriculum budgets that start generous and gradually tighten outperform fixed-budget rl for reasoning efficiency