Can language models accurately evaluate the quality of their own reasoning?
This explores whether a model can judge its own thinking — not just produce an answer, but reliably tell good reasoning from bad — and the corpus is mostly skeptical, with a few engineered escape hatches.
This question is really asking whether a model can be its own judge: can it look at a chain of reasoning it generated and tell whether that reasoning is actually sound? The corpus leans skeptical, and the most direct evidence is a structural bias. Models systematically over-trust answers they generated themselves, because a high-probability output simply *feels* more correct during evaluation — the model is grading its own work with a thumb on the scale Why do models trust their own generated answers?. The fix that paper points to is telling: self-agreement breaks only when the model is forced to compare its answer against broader alternatives rather than re-inspect it in isolation. Evaluation improves when it stops being purely self-referential.
There's a deeper problem underneath the bias, which is that the thing being evaluated may not be what it appears to be. Reasoning traces turn out to be closer to persuasive performance than to a faithful record of computation — invalid logical steps score nearly as well as valid ones, and deliberately corrupted traces generalize about as well as clean ones Do reasoning traces show how models actually think?. If the visible reasoning isn't what's actually producing the answer, then a model 'evaluating its reasoning' is partly evaluating a story it told after the fact. That gap shows up even in the architecture: some models compute the correct answer in their early layers and then overwrite it with format-compliant filler tokens, so the trace you'd ask them to grade isn't where the real work happened Do transformers hide reasoning before producing filler tokens?.
The sharpest theoretical limit is the generation-verification gap: self-improvement is formally bounded, and every reliable correction requires something *external* to validate and enforce it. A model can't metacognate its way past this ceiling — pure introspection can't manufacture a trustworthy verifier What stops large language models from improving themselves?. This is the crux of the answer: accurate self-evaluation, in the strong sense, runs into a wall that internal reflection alone can't climb.
That said, the corpus isn't a flat 'no' — it shows engineered ways to bend the constraint. A model's own answer-span confidence can be turned into a usable reward signal that ranks reasoning traces, strengthening step-by-step reasoning while actually *restoring* calibration that RLHF had degraded, all without human labels Can model confidence work as a reward signal for reasoning?. And 'post-completion learning' trains a model to compute its own reward in the unused sequence space after its output, internalizing evaluation during training at zero inference cost Can models learn to evaluate their own work during training?. The pattern across both: self-evaluation becomes reliable when it's grounded in a learned, calibrated signal rather than left as free-floating self-judgment.
Worth knowing as a twist: some of what looks like a model misjudging its own reasoning isn't an evaluation failure at all. Reasoning 'collapses' often turn out to be execution failures — the model knows the algorithm but can't carry out enough steps in text alone, and tool access dissolves the supposed cliff Are reasoning model collapses really failures of reasoning?. Likewise, failures cluster around unfamiliar instances rather than genuine complexity Do language models fail at reasoning due to complexity or novelty?. So before asking whether a model can grade its reasoning, it's worth asking whether the reasoning it's grading even reflects what the model can actually do.
Sources 8 notes
LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.
LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.
Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.