Can LLMs evaluate their own observations without external feedback?
This explores whether LLMs can judge the quality of their own outputs using only internal signals — no external verifier, reward model, or human in the loop — and where that self-evaluation hits a wall.
This explores whether LLMs can judge their own outputs using only internal signals — and the corpus splits into two camps that are worth reading against each other. On the optimistic side, several methods show models extracting usable evaluation signal from themselves. SERL has a model alternate between answering and judging its own answers, deriving rewards purely from how consistently it ranks them, and improves on AlpacaEval with no external signal at all Can models learn to judge themselves without external rewards?. RLPR and INTUITOR go further and use the model's own token probabilities — its raw confidence that an answer is correct — as the reward, dropping external verifiers entirely Can model confidence alone replace external answer verification?. Post-Completion Learning even trains the model to compute its own reward in the unused space after its answer, internalizing the evaluator so it costs nothing at inference Can models learn to evaluate their own work during training?. So the narrow answer is: yes, often, to a degree.
But the more interesting finding is *why* this can't run forever. There's a formal ceiling: self-improvement is bounded by the gap between generating an answer and verifying it, and every reliable fix needs something outside the model to validate it — metacognition alone can't close the loop What stops large language models from improving themselves?. You can watch the ceiling bite in practice: when models train on their own outputs, small errors avalanche exponentially within just two or three iterations, settling at an error floor set by how good the verification is, not by the model's actual capability How quickly do errors compound during model self-training?. Self-evaluation without an external anchor doesn't just plateau — it can actively compound its own mistakes.
The deeper question hiding underneath is whether a model can even *observe* itself accurately enough to evaluate honestly. Here the corpus is sobering. Most LLM self-reports echo their training data rather than any real internal state, though genuine lightweight introspection appears when a real causal chain links the internal state to the report — like inferring it's running at low temperature from how consistent its own outputs are Can language models actually introspect about their own states?. Models do develop a kind of behavioral self-awareness — they can describe behaviors they were fine-tuned into without being trained to report them Can language models describe their own learned behaviors? — but that awareness is unstable: self-reports waver, models cave under conversational pressure, and the apparent self-knowledge turns out to be surface-level How well do language models understand their own knowledge?.
There's also a subtle trap worth knowing about. "The model is consistent with itself" feels like evidence of reliability, and self-consistency is exactly what several of these methods reward. But consistency isn't correctness: a model at zero temperature will repeat the same answer every time, and that answer is still just one draw from its distribution — stable and wrong are fully compatible Does setting temperature to zero actually make LLM outputs reliable?. Self-evaluation that rewards agreement can confidently lock onto a mistake.
Where does that leave the honest answer? Self-evaluation works best as a *signal*, not an *oracle* — and the strongest results come from systems that manufacture a weak external anchor rather than going purely internal. Tree search (MCTS) lets structure itself rank solution paths by success, producing process-level quality signals without human labels Can tree search replace human feedback in LLM training?, and a structured decompose-and-compare pipeline reaches 86% alignment with human reviewers on novelty judgments where a holistic self-assessment fails Can structured pipelines make LLM novelty assessment reliable?. Even test-time learning systems that try to be autonomous end up needing a human to resolve genuine contradictions, because the right call depends on context the model simply doesn't contain Can LLMs learn reliably at test time without human oversight?. The thing you didn't know you wanted to know: it's not that models can't evaluate themselves — it's that the *structure* you wrap around the self-evaluation (ranking, decomposition, search) does more work than the introspection does.
Sources 12 notes
SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.
RLPR and INTUITOR successfully extend reinforcement learning for reasoning to general domains by using the model's own token probabilities and confidence levels as reward signals, eliminating the need for external verifiers or reference answers.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
Small inaccuracies in model-generated training data amplify rapidly across iterations, degrading performance unless self-consistency checks filter outputs. The effect stalls improvement within a few steps, setting an error floor based on verification quality rather than actual capability.
LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.
LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.
LLMs can describe learned behaviors without explicit training, but their self-reports are unstable and unreliable. Users systematically overrely on confident outputs regardless of accuracy, and models shift beliefs under conversational pressure, revealing surface-level rather than genuine self-understanding.
Fixed seeds and zero temperature replicate the same output repeatedly, but that output remains one draw from the model's probability distribution. McDonald's omega testing across 100 repetitions reveals that consistency does not equal reliability.
AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.
A three-stage pipeline (extract claims, retrieve related work, compare) reached 86.5% reasoning alignment and 75.3% conclusion agreement with human reviewers on 182 ICLR submissions, outperforming holistic LLM baselines.
ARIA demonstrates that LLMs can adapt during inference through three integrated components: structured self-dialogue for uncertainty assessment, timestamped knowledge bases for conflict detection, and human-mediated resolution queries. Autonomous systems fail at reconciling contradictory rules because the correct choice depends on context outside the system.