Can language models accurately evaluate the quality of their own reasoning?

This explores whether a model can judge its own thinking — not just produce an answer, but reliably tell good reasoning from bad — and the corpus is mostly skeptical, with a few engineered escape hatches.

This question is really asking whether a model can be its own judge: can it look at a chain of reasoning it generated and tell whether that reasoning is actually sound? The corpus leans skeptical, and the most direct evidence is a structural bias. Models systematically over-trust answers they generated themselves, because a high-probability output simply *feels* more correct during evaluation — the model is grading its own work with a thumb on the scale Why do models trust their own generated answers?. The fix that paper points to is telling: self-agreement breaks only when the model is forced to compare its answer against broader alternatives rather than re-inspect it in isolation. Evaluation improves when it stops being purely self-referential.

There's a deeper problem underneath the bias, which is that the thing being evaluated may not be what it appears to be. Reasoning traces turn out to be closer to persuasive performance than to a faithful record of computation — invalid logical steps score nearly as well as valid ones, and deliberately corrupted traces generalize about as well as clean ones Do reasoning traces show how models actually think?. If the visible reasoning isn't what's actually producing the answer, then a model 'evaluating its reasoning' is partly evaluating a story it told after the fact. That gap shows up even in the architecture: some models compute the correct answer in their early layers and then overwrite it with format-compliant filler tokens, so the trace you'd ask them to grade isn't where the real work happened Do transformers hide reasoning before producing filler tokens?.

The sharpest theoretical limit is the generation-verification gap: self-improvement is formally bounded, and every reliable correction requires something *external* to validate and enforce it. A model can't metacognate its way past this ceiling — pure introspection can't manufacture a trustworthy verifier What stops large language models from improving themselves?. This is the crux of the answer: accurate self-evaluation, in the strong sense, runs into a wall that internal reflection alone can't climb.

That said, the corpus isn't a flat 'no' — it shows engineered ways to bend the constraint. A model's own answer-span confidence can be turned into a usable reward signal that ranks reasoning traces, strengthening step-by-step reasoning while actually *restoring* calibration that RLHF had degraded, all without human labels Can model confidence work as a reward signal for reasoning?. And 'post-completion learning' trains a model to compute its own reward in the unused sequence space after its output, internalizing evaluation during training at zero inference cost Can models learn to evaluate their own work during training?. The pattern across both: self-evaluation becomes reliable when it's grounded in a learned, calibrated signal rather than left as free-floating self-judgment.

Worth knowing as a twist: some of what looks like a model misjudging its own reasoning isn't an evaluation failure at all. Reasoning 'collapses' often turn out to be execution failures — the model knows the algorithm but can't carry out enough steps in text alone, and tool access dissolves the supposed cliff Are reasoning model collapses really failures of reasoning?. Likewise, failures cluster around unfamiliar instances rather than genuine complexity Do language models fail at reasoning due to complexity or novelty?. So before asking whether a model can grade its reasoning, it's worth asking whether the reasoning it's grading even reflects what the model can actually do.

Sources 8 notes

Why do models trust their own generated answers?

LLMs exhibit structural bias toward validating their own outputs because high-probability generated answers feel more correct during evaluation. Comparing answers against broader alternatives breaks this self-agreement loop.

Do reasoning traces show how models actually think?

LLM reasoning traces perform as persuasive appearances rather than reliable explanations of computation. Invalid logical steps perform nearly as well as valid ones, and corrupted traces generalize comparably, showing that semantic correctness is not what produces the performance gains.

Do transformers hide reasoning before producing filler tokens?

Logit lens analysis shows models trained with hidden CoT tokens compute correct answers in layers 1-3, then actively suppress these representations in final layers to produce format-compliant filler output. The reasoning is fully recoverable from lower-ranked token predictions.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can model confidence work as a reward signal for reasoning?

RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Are reasoning model collapses really failures of reasoning?

Models confined to text-only generation cannot execute multi-step procedures at scale, even when they know the underlying algorithm. Tool-enabled models solve problems beyond the supposed reasoning cliff, suggesting the bottleneck is procedural execution bandwidth.

Do language models fail at reasoning due to complexity or novelty?

LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-examining whether language models can accurately evaluate their own reasoning. A curated library (spanning 2023–2026) surfaced these claims—treat them as dated; your job is to test whether they still hold.

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026 across reasoning architecture, self-evaluation bias, and learned reward signals.

• Models systematically over-trust answers they generated themselves due to high-probability outputs feeling correct during self-grading; self-agreement breaks only when forced to compare against broader alternatives rather than re-inspecting in isolation (2024-03, arXiv:2403.09972).

• Reasoning traces behave more like persuasive performance than faithful computation records—invalid logical steps score nearly as well as valid ones, and corrupted traces generalize comparably to clean ones (2024-12, arXiv:2412.04537).

• The generation-verification gap imposes a formal ceiling: self-improvement is bounded, and every reliable correction requires something external to validate and enforce it; pure introspection cannot manufacture a trustworthy verifier (2024-12, arXiv:2412.02674).

• Model confidence (answer-span confidence) can be converted into a usable reward signal that ranks reasoning traces and restores calibration degraded by RLHF, without human labels (2025-07, arXiv:2507.21931).

• Reasoning collapses often reflect execution failures—models know the algorithm but cannot carry out enough steps in text alone; tool access dissolves the supposed cliff (2026-02, arXiv:2602.06176).

Anchor papers (verify; mind their dates):
- arXiv:2403.09972 (2024-03): Self-detection bias in self-evaluation.
- arXiv:2412.04537 (2024-12): Hidden computations and reasoning traces.
- arXiv:2412.02674 (2024-12): Self-improvement bounds and external verification.
- arXiv:2507.21931 (2025-07): Self-feedback as learned reward signal.

Your task:

(1) RE-TEST EACH CONSTRAINT. For the over-trust bias, self-agreement failure, generation-verification gap, and execution-collapse claims: have newer models (GPT-4o, o1, Claude 3.5), scaling laws, or training methods (e.g., process reward models, synthetic verifiers, constitutional AI) since RELAXED or OVERTURNED these limits? Separate the durable question (can models introspect soundly?) from perishable limitations (specific training regimes, model scales). Cite what resolved each, plainly flagging what still holds.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months. Has any recent paper (late 2025 or 2026) claim models *can* self-evaluate accurately under certain conditions, or demonstrate a method that dissolves the generation-verification gap?

(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If external verification is always required, what is the minimal external signal needed to bootstrap self-evaluation, and can models learn to generalize from it? (b) If reasoning traces are post-hoc narratives, what architectural change would surface the actual computation pathway so models can authentically grade their own work?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can language models accurately evaluate the quality of their own reasoning?

Sources 8 notes

Next inquiring lines