Post-Completion Learning for Language Models

Paper · arXiv 2507.20252 · Published July 27, 2025

Current language model training paradigms typically terminate learning upon reaching the end-of-sequence (<eos) token, overlooking the potential learning opportunities in the post-completion space. We propose Post-Completion Learning (PCL), a novel training framework that systematically utilizes the sequence space after model output completion, to enhance both the reasoning and self-evaluation abilities. PCL enables models to continue generating self-assessments and reward predictions during training, while maintaining efficient inference by stopping at the completion point.

To fully utilize this post-completion space, we design a whitebox reinforcement learning method: let the model evaluate the output content according to the reward rules, then calculate and align the score with the reward functions for supervision. We implement dual-track SFT to optimize both reasoning and evaluation capabilities, and mixed it with RL training to achieve multi-objective hybrid optimization.

Experimental results on different datasets and models demonstrate consistent improvements over traditional SFT and RL methods. Our method provides a new technical path for language model training that enhances output quality while preserving deployment efficiency.

Large language models have demonstrated remarkable capabilities across various natural language processing tasks (Brown et al. 2020; Ouyang et al. 2022; Dubey et al. 2024; Yang et al. 2025). However, improving the reasoning quality and output reliability of these models remains a significant challenge. Current training methods primarily fall into two categories: supervised fine-tuning (SFT) approaches that directly train models on high-quality demonstration data (Wei et al. 2021; Chung et al. 2024), and reinforcement learning-based methods such as RLHF that optimize model behavior through external reward signals (Schulman et al. 2017; Shao et al. 2024). While each approach has its merits, both suffer from inherent limitations. Supervised fine-tuning methods, though stable during training, are constrained by their passive learning nature— models learn to mimic high-quality demonstrations without developing the ability to assess and improve their own reasoning processes (Hong, Dragan, and Levine 2024).

Reinforcement learning approaches can optimize model behavior through reward signals but rely on external reward models, making the training process complex and lacking transparency (Shao et al. 2025). Additionally, both paradigms share a common limitation: they terminate the learning process immediately upon reaching the end-of-sequence token, thereby missing valuable opportunities to utilize the sequence space after model output completion. In conventional training paradigms, models stop generating upon reaching the end-of-sequence (<eos) token, and the training process terminates accordingly. However, this practice actually wastes a valuable learning opportunity. Humans, after completing a task, often engage in self-reflection and quality assessment—this “post-thinking” process is crucial for improving future performance (Schon and DeSanctis 1986). This observation leads us to a fundamental question: Can language models continue learning after “completing” their output, thereby developing self-evaluation and quality awareness?

To address this question, we propose Post-Completion Learning (PCL), a novel language model training paradigm as shown in Figure 1. The core innovation of PCL lies in utilizing the post-completion space that has been neglected, enabling simultaneous enhancement of both reasoning and self-evaluation abilities through continued learning after output completion.

Our approach begins by discovering and leveraging the post-completion space—the sequence space after <eos that has been ignored in traditional training. By inserting a temporary termination marker <-- post-completion ->, we create a “post-thinking” space for models with zero inference cost, as the model stops generation at this marker during deployment while the self-evaluation capabilities remain internalized. To effectively utilize this space, we design a white-box reinforcement learning paradigm where models explicitly learn to understand and compute reward functions, internalizing the reward model as their own evaluation capability and achieving a transformation from “passive reward acceptance” to “active self-evaluation”. To train these dual capabilities effectively, we develop a unified hybrid training framework that combines SFT and RL within the same sequence generation framework, using a dual-track training strategy to simultaneously optimize reasoning and evaluation capabilities.

Related Work

Chain-of-Thought and Self-Correction Methods Chain-of-thought (CoT) reasoning has evolved from basic few-shot prompting methods to sophisticated reasoning systems (Wei et al. 2022; Ouyang et al. 2022; Zhang et al. 2022). Researches have debated the optimal placement of reasoning content—whether CoT should precede the final answer or follow it for verification purposes. However, we argue that these approaches are not mutually exclusive: models can benefit from both pre-answer reasoning for problem-solving and post-answer reasoning for self-evaluation. Crucially, the self-evaluation component need not be generated during inference, thereby avoiding computational overhead while still enhancing model capabilities during training.

Self-correction methods such as Self-Refine (Madaan et al. 2023) and Reflexion (Shinn et al. 2023) represent important advances in model self-improvement capabilities with post-answer verification. However, these methods primarily focus on output optimization rather than process improvement (Huang et al. 2023; Kamoi et al. 2024). The STaR method (Zelikman et al. 2022) and Constitutional AI (Bai et al. 2022) achieve capability enhancement through self-taught reasoning but remains limited to external guidance during inference.

The key distinction of PCL from these methods lies in developing effective training paradigms that enhance models’ introspective abilities during training, rather than relying on additional content generation during inference. PCL achieves the goal of ”training-time reflection, inference-time efficiency” through post-completion space learning, internalizing self-evaluation capabilities into the model itself.

Reinforcement Learning Optimization and Reward

Modeling

RLHF methods have developed from early work by Christiano et al. (Christiano et al. 2017) to the three-stage training paradigm of InstructGPT (Ouyang et al. 2022), becoming the standard approach for language model alignment. However, existing RLHF methods primarily rely on external reward models or functions, suffering from issues like opaque reward modeling and reward hacking.

Constitutional AI (Bai et al. 2022) achieves a degree of white-box evaluation through in-context principle-based assessment, but still lacks the complete transparency of reward mechanisms that PCL provides. Process reward models (PRMs) outperform outcome reward models on complex reasoning tasks (Lightman et al. 2023; Uesato et al. 2022; Zhang et al. 2025), supporting PCL’s design philosophy of evaluating complete reasoning processes.

Methods like Direct Preference Optimization (DPO) (Rafailov et al. 2023) simplify the training process by eliminating independent reward models and directly optimizing policies from preference data. Besides, Group Relative Policy Optimization (GRPO) (Shao et al. 2024) further advances this paradigm by enabling stable group-wise preference optimization without requiring explicit reward modeling, demonstrating improved sample efficiency in mathematical reasoning tasks.

PCL achieves the transformation from ”passive reward acceptance” to ”active self-evaluation” by having models explicitly learn reward function computation processes, representing an important advance in white-box reinforcement learning. Unlike traditional approaches that rely on external supervision, PCL internalizes the evaluation mechanisms during training, enabling models to perform quality assessment autonomously.

Self-Supervised Learning and Meta-Learning Self-supervised learning has established a solid foundation through BERT’s bidirectional learning and GPT’s autoregressive modeling (Devlin et al. 2019; Brown et al. 2020), but recent studies reveal critical meta-cognitive deficiencies. Language models demonstrate significant metacognitive inadequacies, consistently failing to recognize knowledge limitations and exhibiting overconfidence (Groot and Valdenegro-Toro 2024). Recent self-improvement approaches face fundamental limitations from the “sharpening mechanism”: self-improvement cannot create information that does not exist within the model (Huang et al. 2024). This highlights PCL’s value—systematically analyzing completion results to identify knowledge gaps rather than redistributing existing knowledge.

Meta-learning research has established foundational paradigms for learning from limited data through classical approaches (Vinyals et al. 2016; Snell, Swersky, and Zemel 2017; Finn, Abbeel, and Levine 2017). Meta-learning trains models on a distribution of tasks to learn generalizable knowledge. PCL extends this meta-learning foundation by enabling models to develop persistent self-evaluation capabilities rather than relying solely on episodic adaptation. Our method bridges these areas by combining training-time optimization with white-box reward evaluation, internalizing meta-cognitive development directly into training rather than relying on post-hoc reflection or external feedback mechanisms.