Can models learn to evaluate their own work during training?
Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.
Current training paradigms terminate learning at the end-of-sequence token, wasting the entire sequence space after model output completion. Post-Completion Learning (PCL) systematically exploits this neglected space. A temporary termination marker (<-- post-completion -->) creates a "post-thinking" space where models continue generating self-assessments and reward predictions during training, while inference stops at the marker — zero additional cost at deployment.
The core innovation is white-box reinforcement learning: the model explicitly learns to understand and compute reward functions, internalizing the reward model as its own evaluation capability. This transforms the model from "passive reward acceptance" (external reward signal tells it what's good) to "active self-evaluation" (it learns to compute quality assessments itself).
Implementation uses dual-track SFT: one track optimizes reasoning, the other optimizes evaluation capability. These are mixed with RL training for multi-objective hybrid optimization. The model learns both to solve problems and to assess its own solutions — but critically, only the problem-solving capability is active during inference. The self-evaluation is internalized during training, shaping the model's generation without requiring explicit self-assessment at inference time.
This addresses three limitations simultaneously:
- SFT's passive learning — models learn to mimic demonstrations without developing self-assessment ability
- RL's external dependency — reward models are opaque external components; PCL internalizes the evaluation
- Self-correction's inference cost — methods like Self-Refine require additional generation passes; PCL's self-evaluation is absorbed into training
The parallel with human cognition is direct: "Humans, after completing a task, often engage in self-reflection and quality assessment — this post-thinking process is crucial for improving future performance." PCL operationalizes this for LLMs.
This connects to What limits how much models can improve themselves? — PCL attempts to close the gap by training the verifier and generator as the same model, with the verification capability internalized rather than external. It also complements Does reflection in reasoning models actually correct errors? — PCL's self-evaluation is trained against ground-truth reward functions, not against the model's own prior outputs, potentially avoiding the confirmatory pattern.
Source: Novel Architectures
Related concepts in this collection
-
What limits how much models can improve themselves?
Explores whether self-improvement has fundamental boundaries set by how well models can verify versus generate solutions, and what this means across different task types.
PCL addresses this by co-training generation and verification in the same model
-
Does reflection in reasoning models actually correct errors?
When reasoning models reflect on their answers, do they genuinely fix mistakes, or merely confirm what they already decided? Understanding this matters for designing better training and inference strategies.
PCL's evaluation is trained against external reward functions, potentially avoiding confirmatory bias
-
Can model confidence work as a reward signal for reasoning?
Explores whether using a language model's own confidence scores as training rewards can simultaneously improve reasoning accuracy and restore calibration that standard RLHF damages.
related: both use the model's own assessment capability as a training signal
-
Do reward models actually consider what the prompt asks?
Exploring whether standard reward models evaluate responses based on prompt context or just response quality alone. This matters because if models ignore prompts, they'll fail to align with what users actually want.
PCL internalizes reward computation, potentially avoiding the prompt-insensitivity problem
Click a node to walk · click center to open · click Open full network for a force-directed map
Original note title
post-completion learning uses the ignored post-eos space to internalize self-evaluation during training with zero inference cost