Rethinking Thinking Tokens: LLMs as Improvement Operators

Paper · arXiv 2510.01123 · Published October 1, 2025
Reinforcement LearningNovel ArchitecturesInference time scaling

Reasoning training incentivizes LLMs to produce long chains of thought (long CoT), which among other things, allows them to explore solution strategies with self-checking. This results in higher accuracy, but inflates context length, token/compute cost, and answer latency. We ask: Can current models leverage their metacognition to provide other combinations on this Pareto frontier, e.g., better accuracy with lower context length and/or latency? Abstractly, we view the model as an improvement operator on its own “thoughts” with a continuum of possible strategies. We study an inference family Parallel-Distill-Refine (PDR), which performs the following: (i) generate diverse drafts in parallel; (ii) distill them into a bounded, textual workspace; and (iii) refine conditioned on this workspace, producing an output that seeds the next round. Importantly, context length (hence compute cost) is controllable via degree of parallelism, and is no longer conflated with the total number of generated tokens. We report PDR instantiations of current models that give better accuracy than long CoT while incurring lower latency. Setting degree of parallelism to 1 yields a subcase Sequential Refinement (SR) (iteratively improve a single candidate answer) which provides performance superior to long CoT (at the cost of higher latency). Success of such model orchestrations raises the question whether further training could shift the Pareto frontier. To this end, we train an 8B thinking model with Reinforcement Learning (RL) to make it consistent with PDR as the inference method. On math tasks with verifiable answers, iterative pipelines surpass single-pass baselines at matched sequential budgets, with PDR delivering the largest gains (+11% on AIME 2024 and +9% on AIME 2025).

What is the best possible task accuracy achievable after fixing constraints on the inference process, e.g.: (i) total tokens across all generations, (ii) max depth of the generation chain (“latency”), (iii) total context length, and (iv) total compute (which depends on all of the above in complicated and system-dependent ways).

The confounding factor is that iteration alone does not guarantee progress. Simply asking the model to “try again” risks forgetting useful partial results and repeating earlier mistakes. Naïvely appending all prior attempts to the context recreates long-context failures and scales cost with the number of rounds. Current models suffer from anchoring biases (see Figure 6, 8) as well as forgetfulness. A viable scheme needs a compact state that (i) carries forward salient facts and intermediate results, (ii) flags disagreements and open subgoals, and (iii) remains bounded so each generation (and overall context-size) stays short.

This paper studies inference strategies that generate many tokens with a compact context size. Instead of long chains of thought, inference has phases that generate solutions within the allowed context/token budget and then write a bounded, round-wise summary/report (e.g., listing agreements, contradictions, intermediate results, and open subgoals). The next phase starts with only this summary and uses available workspace for fresh generations (which benefit from accumulated wisdom in the summary). Iterating this process can generate long “thinking” albeit with a bounded context size.

Its effectiveness hinges on four meta-skills: verification (detect and localize errors via self-judging and cross-candidate checks), refinement (use feedback/context to improve the artifact), compression (retain only past history via bounded summaries rather than replay), and diversification (exploratory variation to avoid consensus collapse).

Learning to improve short-context iteration. It is also of interest to teach the model a policy that effectively leverages this improvement operator. Standard RL training for reasoning models typically optimizes a single, long chain-of-thought conditioned on the prompt, with reward on the final answer (Shao et al., 2024; Guo et al., 2025). PDR, by contrast, comprises multiple short iterations that read a bounded summary, write a refinement, and re-synthesize a fresh summary. This creates a train-test mismatch in the information flow (short updates vs. one long trace). To make sure training is consistent with deployment, we optimize an objective that unrolls the operator itself during training: sample M short drafts, distill them into a compact summary, and condition on the prompt plus that summary to produce a refined attempt. We use verifiable rewards to supervise the end-to-end computation. This objective narrows the train–test gap.

In this paper, we initiate the exploration of a broader design space around “long CoT.” We study two operators in this design space, SR and PDR which give better accuracy compared to standard long CoT, while offering the benefit of smaller context size. Empirically, compact-memory iteration outperforms long-trace baselines at matched Bseq. PDR yields the largest gains (e.g., +11% on AIME 2024 and +9% on AIME 2025), showing that evidence accumulation via bounded summaries can substitute for long reasoning traces while holding latency fixed. Beyond inference orchestration, making sure that training is consistent with inference using an operator-consistent RL objective further improves performance (e.g., ∼ 5% on AIME 2024 and AIME 2025), suggesting that models can learn the meta-skills that make iteration effective. Iterative reasoning improves when diversity, verification, and refinement become reliably good; by measuring and training these micro-skills directly, we can accelerate the gains of improvement operators under fixed latency budgets.