Revisiting LLM Reasoning via Information Bottleneck
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR). By leveraging simple rule-based rewards, RL effectively incentivizes LLMs to produce extended chain-of-thought (CoT) reasoning trajectories, progressively guiding them toward correct answers. However, existing approaches remain largely heuristic and intuition-driven, limiting the development of principled methodologies. In this paper, we present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle, introducing IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable across diverse prompts. We derive a practical token-level surrogate objective and propose an efficient approximation, resulting in the lightweight IB regularization method. This technique integrates seamlessly into existing RL-based post-training frameworks without additional computational overhead, requiring only a one-line code modification. Empirically, we validate IB regularization across multiple mathematical reasoning benchmarks and RL algorithms, demonstrating consistent improvements in LLM reasoning performance.
Accordingly, prior works often advocate heuristically maintaining high generation entropy, i.e., encouraging token-level uncertainty, during post-training [4, 5, 6]. In contrast, another line of research suggests that explicitly reducing entropy or uncertainty, even in the absence of reward signals, can lead to improved reasoning performance [7, 8, 9]. These conflicting findings underscore the need for a rigorous theoretical understanding of reasoning in LLMs, which remains elusive yet essential.
Specifically, IBRO encourages reasoning processes to maximize informativeness with respect to (w.r.t.) correct answers while minimizing dependency on irrelevant, prompt-specific details. We then derive a token-level surrogate IBRO objective (Theorem 1) and establish a high-probability generalization bound to theoretically justify the IBRO formulation (Theorem 2). To facilitate practical implementation, we derive an efficient approximation of the IBRO objective, resulting in a novel IB regularization term. Concretely, IB regularization modulates the token-level entropy based on their corresponding advantages, incentivizing higher entropy for critical tokens and penalizing uninformative ones. Our IB regularization seamlessly integrates into existing RL-based post-training frameworks, introducing negligible computational overhead and requiring only a single line of code modification.