Post-Training Large Language Models via Reinforcement Learning from Self-Feedback

Paper · arXiv 2507.21931 · Published July 29, 2025

Large Language Models (LLMs) often produce plausible but poorly-calibrated answers, limiting their reliability on reasoning-intensive tasks. We present Reinforcement Learning from Self- Feedback (RLSF), a post-training stage that uses the model’s own confidence as an intrinsic reward, mimicking how humans learn in the absence of external feedback. After a frozen LLM generates several chain-of-thought solutions, we define and compute the confidence of each final answer span and rank the traces accordingly. These synthetic preferences are then used to fine-tune the policy with standard preference optimization, similar to RLHF yet requiring no human labels, gold answers, or externally curated rewards.

RLSF simultaneously (i) refines the model’s probability estimates – restoring well-behaved calibration – and (ii) strengthens step-by-step reasoning, yielding improved performance on arithmetic reasoning and multiple-choice question answering.

By turning a model’s own uncertainty into useful self-feedback, RLSF affirms reinforcement learning on intrinsic model behaviour as a principled and data-efficient component of the LLM post-training pipeline and warrants further research in intrinsic rewards for LLM post-training.

1 Introduction

Recent advances in large language models (LLMs) have led to impressive capabilities in text generation and comprehension (Brown et al., 2020; Ouyang et al., 2022). Nevertheless, performance often degrades on tasks that demand logical reasoning, a critical limitation when LLMs are deployed in domains such as legal analysis, scientific computation, and decision support (Kambhampati, 2024). While contextually appropriate text can be produced with ease, consistency, and accuracy across extended chains of reasoning is frequently not maintained.

It has also been observed that the output of an LLM is largely uncalibrated – its confidence is not predictive of its accuracy, particularly after reinforcement learning from human feedback (RLHF) is applied (Bai et al., 2022). Such miscalibration results in overconfidence during complex reasoning tasks (OpenAI et al., 2024a; Tian et al., 2023). It has been observed, however, that confidence plays a critical role in human learning. Humans use confidence as an intrinsic reward in the absence of external feedback (Ptasczynski et al., 2022).

Confidence as Reward Our approach is based on a simple observation: In a well-calibrated model, answer confidence is correlated with the presence of reasoning, which in turn leads to better quality answers. We build upon this observation and use confidence as an intrinsic reward signal in reinforcement learning.

To construct our preference dataset, we apply Chain-of-Thought decoding to make the LLM generate a collection of candidate answers. The generated beams can be ranked by the model’s answer confidence to produce a preference dataset, which can then be used to train a reward model that assesses answer quality, as in Figure 1. Then this reward model is used to fine-tune the original LLM via reinforcement learning.

Practically, we implement this in a similar spirit to RLHF, and we call this post-training step Reinforcement Learning from Self-Feedback (RLSF). In particular, our proposed method can be inserted as an additional step in the customary model post-training pipeline (Kumar et al., 2025), augmenting typical techniques such as Supervised Fine- Tuning (SFT), Preference Optimization (PO), and task-specific reward modelling like RLVR (Lambert et al., 2025).

Supervised fine-tuning on diverse datasets has been shown to preserve or even improve token-level calibration (Kuhn et al., 2022; Xiao et al., 2022). However, instruction-tuned or aligned models trained with reinforcement learning from human feedback (RLHF) often exhibit degraded calibration

This is attributed to the reward signals in RLHF optimizing for human preference and fluency, rather than correctness or calibrated uncertainty.