Can external retrieval signals outperform internal self-assessment during revision?

This explores whether feeding a model outside information — retrieved evidence or an external critic — beats having it grade and fix its own work, specifically at the moment of revision.

This explores whether outside signals (retrieved facts, a separate critic model) do better than a model's own self-judgment when it goes back to revise an answer — and the corpus leans hard toward yes, with an important asterisk. The cleanest finding is that the *source* of the revision, not the act of revising, decides the outcome: external critics improve accuracy, while a model second-guessing its own uncertain output usually just amplifies its confidence in the wrong answer Does revising your own reasoning actually help or hurt?. That's reinforced by direct evidence from o1-style reasoning models, where self-revision actually *degrades* accuracy — most revisions keep the wrong answer, smaller models flip correct answers to incorrect, and longer revision chains correlate with worse results Does self-revision actually improve reasoning in language models?.

Why does pure self-assessment struggle? Part of the answer is that models are bad introspectors. Their self-reports mostly echo patterns in their training data rather than reporting on any real internal state — genuine introspection only happens in the narrow case where a causal chain links an internal state to the report Can language models actually introspect about their own states?. If a model can't reliably read its own state, asking it to self-correct from that state is shaky ground.

The retrieval side shows what an external signal buys you. Framing 'when to pull in outside knowledge vs. trust internal memory' as a learned decision (an MDP) yields a ~22% accuracy jump, largely by retrieving at the right moments and cutting noise from unnecessary lookups When should language models retrieve external knowledge versus use internal knowledge?. And supervising the *intermediate* retrieval steps — not just the final answer — substantially outperforms outcome-only rewards, because it can contrast good and bad evidence-gathering chains directly Does supervising retrieval steps outperform final answer rewards?. External grounding works best when it's targeted and step-aware.

But the dichotomy isn't absolute, and that's the part worth knowing. Several notes show internal self-evaluation *can* be made to work when it's trained in rather than improvised at inference. Models can internalize self-assessment using the unused sequence space after their output, learning to compute their own reward at zero inference cost Can models learn to evaluate their own work during training?. Self-Examining RL has a model alternate between answering and judging, lifting win rates with no external reward at all Can models learn to judge themselves without external rewards?. And critique models embedded in the training loop help by preserving solution *diversity* — preventing premature convergence — a benefit that's arguably deeper than test-time accuracy Do critique models improve diversity during training itself?.

So the synthesis: external signals reliably win when revision is an ad-hoc, test-time act, because unaided self-correction tends to launder errors into confidence. Internal self-assessment becomes competitive only when it's *built into training* as a learned skill rather than asked of the model on the fly. A hybrid points the way forward — systems that write their own validated outputs back into a retrieval corpus, but only after external entailment and novelty checks gate them, blend internal generation with external verification instead of trusting either alone Can RAG systems safely learn from their own generated answers?.

Sources 9 notes

Does revising your own reasoning actually help or hurt?

Revision guided by external models improves accuracy, but a model revising its own uncertain output typically amplifies confidence in wrong answers rather than correcting them. The revision source, not the revision act itself, determines the outcome.

Does self-revision actually improve reasoning in language models?

Evidence from QwQ, R1, and LIMO shows most revisions retain wrong answers rather than correcting them. Smaller models frequently switch correct answers to incorrect during revision, and longer chains with more revisions correlate with lower accuracy.

Can language models actually introspect about their own states?

LLM self-reports usually reflect human training distributions rather than actual internal processes. However, when a causal chain connects an internal state to accurate reporting—like inferring low temperature from output consistency—genuine lightweight introspection occurs without requiring consciousness.

When should language models retrieve external knowledge versus use internal knowledge?

DeepRAG models each reasoning step as a Markov Decision Process where the model learns when to retrieve versus rely on parametric knowledge. The 21.99% improvement comes from better-targeted retrieval and elimination of noise from unnecessary external knowledge.

Does supervising retrieval steps outperform final answer rewards?

Fine-grained feedback on intermediate retrieval steps significantly boosts agentic RAG performance compared to final-answer-only rewards. DPO trained with both positive and negative step feedback outperforms PPO and single-direction training by directly contrasting good and bad retrieval chains.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Can models learn to judge themselves without external rewards?

SERL enables self-improving language models by having them alternate between generating responses and judging them pairwise, deriving rewards from ranking consistency and self-consistency of judgments. On AlpacaEval, this reached 59.90% win rate without external signals, up from 52.37%.

Do critique models improve diversity during training itself?

Step-level critique in the training loop counteracts tail narrowing and maintains solution diversity across self-training iterations. This training-time benefit—preventing premature convergence—is more fundamental than test-time accuracy gains.

Can RAG systems safely learn from their own generated answers?

Systems can add generated answers to their retrieval corpus when outputs pass entailment verification, source attribution checks, and novelty detection. This prevents hallucinations from polluting future retrievals while allowing genuine knowledge accumulation.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

As an LLM researcher, test whether external retrieval signals truly outperform internal self-assessment during revision—a question that may have shifted since these findings (2024–2025).

What a curated library found — and when (dated claims, not current truth):
• External critics improve revision accuracy; model self-correction alone amplifies confidence in wrong answers (2024–2025).
• Self-revision in o1-like models *degrades* accuracy; longer chains correlate with worse results (~2025).
• Models are poor introspectors—self-reports mostly reflect training data, not true internal state; genuine introspection is rare (~2025).
• Learned retrieval decisions (framed as MDPs) yield ~22% accuracy gains by retrieving at optimal moments (~2025).
• Process-level supervision (intermediate steps) outperforms outcome-only rewards for retrieval-guided reasoning (~2025).
• Internal self-assessment *can* work when trained in (e.g., post-completion learning, Self-Examining RL) rather than improvised at inference (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2404.09129 (2024) — reflective thinking limits
• arXiv:2507.20252 (2025) — post-completion learning
• arXiv:2502.12215 (2025) — o1-like test-time scaling
• arXiv:2506.05068 (2025) — introspection in LLMs

Your task:
(1) RE-TEST EACH CONSTRAINT. For every finding, judge whether newer training methods (self-reward models, diffusion-guided reasoning, temporal self-rewarding ~2025), retrieval orchestration (caching, agentic multi-hop), or evals have RELAXED the gap between external and internal signals. Separate the durable question (when *should* you retrieve vs. introspect?) from the perishable claim (current models fail at self-assessment). Cite what resolved it.
(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially unifying approaches (UR2, UR3 if it exists) that may dissolve the external/internal dichotomy.
(3) Propose 2 research questions that ASSUME the regime may have moved: e.g., *Can online learning of retrieval policy during deployment outpace pre-trained critic models?* or *Does embedding critique into the forward pass (rather than post-hoc revision) change the accuracy-speed tradeoff?*

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can external retrieval signals outperform internal self-assessment during revision?

Sources 9 notes

Next inquiring lines