Learning to Think: Information-Theoretic Reinforcement Fine-Tuning for LLMs
However, existing methods overlook the trade-off between reasoning effectiveness and computational efficiency, often encouraging unnecessarily long reasoning chains and wasting tokens. To address this, we propose Learning to Think (L2T), an information-theoretic reinforcement fine-tuning framework for LLMs to make the models achieve optimal reasoning with fewer tokens. Specifically, L2T treats each query-response interaction as a hierarchical session of multiple episodes and proposes a universal dense process reward, i.e., quantifies the episode-wise information gain in parameters, requiring no extra annotations or task-specific evaluators. We propose a method to quickly estimate this reward based on PAC-Bayes bounds and the Fisher information matrix. Theoretical analyses show that it significantly reduces computational complexity with high estimation accuracy. By immediately rewarding each episode’s contribution and penalizing excessive updates, L2T optimizes the model via reinforcement learning to maximize the use of each episode and achieve effective updates
Recent results [24, 19, 56, 43] in LLM reasoning show that scaling test-time compute can substantially improve reasoning capabilities, e.g., [35, 5] demonstrated that generating more tokens during inference yields logarithmic-linear gains. Based on this, a new class of reasoning models [43, 53, 9, 49] has coupled test-time compute scaling with reinforcement learning (RL), achieving state-of-the-art (SOTA) results on various challenging benchmarks [15, 61, 16]. These models employ chain-of-thought (CoT) tokens to guide multi-step reasoning and maintain logical consistency throughout the solution process [50, 52, 58]; by extending and optimizing CoT paths to produce trajectories longer than typical correct solutions, they more thoroughly explore the solution space and thereby boost final answer accuracy [43, 35, 20].
Despite existing methods having demonstrated great performance, they still struggle to balance reasoning effectiveness with efficiency. Specifically, existing approaches typically rely on final outcome rewards for policy optimization, providing no feedback on intermediate reasoning steps. Under such sparse feedback, extending the chain does not incur any cost, and even a tiny accuracy gain from a large amount of extra steps is treated as a positive signal [59, 55]. Consequently, the models favor a “one more thought” and continually lengthen their CoTs, resulting in redundant computation. Our experiments in Subsection 3.2 further demonstrate this (Figure 1): existing outcome-reward-based RL methods often lead LLMs to consume more than twice the tokens actually needed. Furthermore, by evaluating across different reasoning tasks, we find that this redundancy not only wastes resources but sometimes degrades reasoning accuracy. For example, on difficult questions (e.g., Tier 4 multi-stage math questions [13]), moderate chain extensions improve coverage of critical steps; whereas on simple tasks (e.g., Tier 1 question “12 + 5”), overly long reasoning chains may reduce overall accuracy. Since real-world tasks varies, no fixed chain length is optimal for all cases. Therefore, designing effective dense process rewards to assess the contribution of each reasoning step is both necessary and valuable. Such rewards help the model to generate the tokens that most benefit the answer, ensuring answer accuracy with minimal token budget.
To this end, we propose Learning to Think (L2T), an information-theoretic reinforcement finetuning framework for LLMs. At its core, L2T proposes a universal information-theoretic dense process reward, which quantifies the information gain in model parameters. It consists of (i) a fitting information gain term that drives the model to capture correctness-critical information in each update; and (ii) a compression penalty that discourages overly optimization, further preserving efficiency. By treating each question-answer pair as a session of multiple episodes and immediately rewarding each episode, it makes the model focuses on the process progress, thus curbing redundant reasoning steps and the resulting computational waste. This reward is independent of input format, label type, or task domain, and no extra annotations are needed. We leverage this reward to train the LLM (also the policy) via reinforcement learning to make it generate the tokens that best contribute to the answer correctness at each reasoning step. Specifically, L2T includes three stages: (i) Problem reformulation (Subsection 4.1): we treat each question-answer interaction as a hierarchical session of multiple episodes, where each episode represents a segment of the reasoning chain that underpins dense reward calculation and optimization; (ii) Reward design (Subsection 4.2): upon episode completion, we calculate the information-theoretic reward via PAC-Bayes bounds and the Fisher information matrix. Based on this, we halt unproductive reasoning and thus balance depth with efficiency; (iii) LLM fine-tuning (Subsection 4.3): we optimize the LLMs by maximizing cumulative reward across tasks via reinforcement learning, ensuring high accuracy and computational efficiency.