Reinforcement Learning for LLMs

Can models learn to evaluate their own work during training?

Explores whether language models can internalize reward function computation as part of training, transforming external feedback into internal self-assessment capability without slowing inference.

Note · 2026-02-23 · sourced from Novel Architectures

Current training paradigms terminate learning at the end-of-sequence token, wasting the entire sequence space after model output completion. Post-Completion Learning (PCL) systematically exploits this neglected space. A temporary termination marker (<-- post-completion -->) creates a "post-thinking" space where models continue generating self-assessments and reward predictions during training, while inference stops at the marker — zero additional cost at deployment.

The core innovation is white-box reinforcement learning: the model explicitly learns to understand and compute reward functions, internalizing the reward model as its own evaluation capability. This transforms the model from "passive reward acceptance" (external reward signal tells it what's good) to "active self-evaluation" (it learns to compute quality assessments itself).

Implementation uses dual-track SFT: one track optimizes reasoning, the other optimizes evaluation capability. These are mixed with RL training for multi-objective hybrid optimization. The model learns both to solve problems and to assess its own solutions — but critically, only the problem-solving capability is active during inference. The self-evaluation is internalized during training, shaping the model's generation without requiring explicit self-assessment at inference time.

This addresses three limitations simultaneously:

  1. SFT's passive learning — models learn to mimic demonstrations without developing self-assessment ability
  2. RL's external dependency — reward models are opaque external components; PCL internalizes the evaluation
  3. Self-correction's inference cost — methods like Self-Refine require additional generation passes; PCL's self-evaluation is absorbed into training

The parallel with human cognition is direct: "Humans, after completing a task, often engage in self-reflection and quality assessment — this post-thinking process is crucial for improving future performance." PCL operationalizes this for LLMs.

This connects to What limits how much models can improve themselves? — PCL attempts to close the gap by training the verifier and generator as the same model, with the verification capability internalized rather than external. It also complements Does reflection in reasoning models actually correct errors? — PCL's self-evaluation is trained against ground-truth reward functions, not against the model's own prior outputs, potentially avoiding the confirmatory pattern.


Source: Novel Architectures

Related concepts in this collection

Concept map
16 direct connections · 118 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

post-completion learning uses the ignored post-eos space to internalize self-evaluation during training with zero inference cost