Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models

Paper · arXiv 2504.02902 · Published April 3, 2025
Reasoning CritiquesAlignmentReinforcement LearningQuestion Answer Search

Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanism has shown promise in enhancing task performance, recent studies suggest that it may also introduce undesirable biases—most notably, self-bias, or the tendency of LLMs to favor their own prior outputs. In this work, we extend this line of inquiry by investigating the impact on confidence estimation. We evaluate three representative self-improvement paradigms—basic prompting, Chain-of-Thought (CoT) prompting, and tuning-based methods—and find that iterative self-improvement can lead to systematic overconfidence, as evidenced by a steadily increasing Expected Calibration Error (ECE) and lower accuracy with high confidence. We then further explore the integration of confidence calibration techniques with self-improvement. Specifically, we compare three strategies: (1) applying calibration after multiple rounds of self-improvement, (2) calibrating before self-improvement, and (3) applying calibration iteratively at each self-improvement step. Our results show that iterative calibration is most effective in reducing ECE, yielding improved calibration. Our work pioneers the study of self-improving LLMs from a calibration perspective, offering valuable insights into balancing model performance and reliability.

1 Introduction

The development of Large Language Models (LLMs) has catalyzed transformative changes across numerous domains, from natural language understanding and generation (Storks et al., 2019; Weld et al., 2022) to assisting in complex question-answering and decision-making processes (Li et al., 2025b; 2024a; Tan et al., 2024). To handle this, one of the emerging techniques for LLMs is self-improvement (Bai et al., 2022; Kim et al., 2023), wherein LLMs iteratively review their own responses and refine their outputs based on self-generated feedback to enhancing the performance. This process fosters human-like reflective thinking and has proven effective across a range of tasks and applications (Tong et al., 2024; Pan et al., 2024; Li et al., 2024b).

However, some recent studies also report cases where LLM-based self-improvement does not bring a significant boost and can even degrade the model’s performance (Zhang et al., 2024a; Wu et al., 2024). One contributing factor to this counterintuitive outcome is self-bias (Xu et al., 2024b; Wataoka et al., 2024; Li et al., 2025a)—the tendency of LLMs to favor their own generated content. This cognitive bias impedes LLMs from providing impartial feedback on their outputs, thereby hindering effective self-correction and self-improvement.

Borrowing this insight, we propose our first research question: Will self-improvement also lead to bias in confidence estimation? As LLMs become increasingly integral to both research and industry applications (Zhu et al., 2025), the ability to accurately express confidence or uncertainty in their outputs is crucial (Su et al., 2024), particularly in high-risk scenarios (Thirunavukarasu et al., 2023; Li et al., 2024c). If self-improvement methods introduce self-bias in confidence estimation, this could pose a significant threat to LLM safety and reliability, creating substantial challenges in the pursuit of trustworthy AI (Sun et al., 2024; Huang et al., 2025). To investigate this, we examine three types of self-improvement methods in our experiments: Basic prompting, Chain-of-Thought (CoT) prompting, and Tuningbased approaches (First et al., 2023; Han et al., 2024; Zhang et al., 2024b; Aky¨ urek et al., 2023; Xie et al., 2025). We implement each method and analyze its impact on LLMs’ confidence estimation performance. Our results reveal a clear trend of increasing overconfidence as self-improvement iterations progress, leading to a continuously rising Expected Calibration Error (ECE) score (Guo et al., 2017).