INQUIRING LINE

Why does most refinement in iterative models maintain answers rather than improve them?

This explores why an AI that revises its own answer over multiple passes usually ends up restating the same answer instead of making it better — and what the corpus says is missing.


This explores why iterative self-revision in LLMs tends to preserve an answer rather than genuinely improve it. The corpus converges on a single root cause: a model can generate a revision, but it can't reliably tell whether the revision is actually better. Self-improvement is formally bounded by what researchers call the generation-verification gap — every reliable fix requires something outside the model to validate and enforce it, and no amount of metacognition lets the model escape that ceiling on its own What stops large language models from improving themselves?. Without an external check, 'refine' collapses into 'rephrase.'

Underneath that, several notes suggest the model isn't really doing the iterative work we imagine it's doing. When asked to run iterative numerical methods, LLMs don't actually execute the procedure step by step — they recognize a problem as template-similar to something memorized and emit a plausible-looking value, a failure that persists across model scale Do large language models actually perform iterative optimization?. Extended chains of thought make this worse in a telling way: reasoning variants produce *more text* on constraint-bound numerical tasks without producing *more computation*, and so don't systematically beat plain models Do reasoning models actually beat standard models on optimization?. So a revision pass adds words around the same memorized guess rather than recomputing toward a better one.

When models do change something, it's often the surface. Supervised fine-tuning teaches outputs to *look* correct — clean JSON, valid identifiers, expected sections — without making them physically feasible, because the model learns the surface features of good solutions rather than the reasoning to construct them Does supervised fine-tuning actually improve reasoning on optimization problems?. That's the mechanism of 'maintain, don't improve' in miniature: the refinement edits the packaging, not the substance. And there's a deeper version — a model can hold all the linearly-decodable features a task needs while its internal organization is fractured, so its answer can be 'right' on the metric yet brittle, with nothing structured inside to revise toward Can models be smart without organized internal structure?.

There's also a noise problem. Sequential revision reproduces the same failure as token-level overthinking — it accumulates noise across iterations with no guarantee of improvement, just at a slower tempo Do iterative refinement methods suffer from overthinking?. Reasoning models compound this by wandering and switching paths prematurely, abandoning promising directions rather than carrying them forward Why do reasoning models abandon promising solution paths?. So even when a better answer is reachable, the refinement loop is as likely to drift away from it as toward it.

The interesting turn is what *does* break the stalemate, and it's the same thing in every case: an external signal. The Darwin Gödel Machine gets real, open-ended improvement precisely by replacing introspection with empirical benchmarking and keeping an archive of variants — it improves because reality grades each attempt Can AI systems improve themselves through trial and error?. The ACE framework gets gains by treating context as an evolving playbook with structured incremental updates instead of full rewrites, which stops each iteration from erasing what the last one learned Can context playbooks prevent knowledge loss during iteration?. The throughline worth taking away: refinement maintains rather than improves whenever the loop has no grader outside itself — give it an empirical test, a preserved memory, or a verifier, and 'revise' starts to mean something.


Sources 9 notes

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Do reasoning models actually beat standard models on optimization?

Reasoning variants with extended CoT show no consistent advantage over standard models on constraint-bound numerical tasks like optimal power flow. Extended thinking produces more text, not more iterative computation, suggesting the bottleneck is numeric procedure rather than reasoning steps.

Does supervised fine-tuning actually improve reasoning on optimization problems?

Supervised fine-tuning makes model outputs look correct—proper JSON structure, valid identifiers, expected sections—without making them physically feasible. The model learns surface features of solutions, not the reasoning to construct valid ones.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do iterative refinement methods suffer from overthinking?

Sequential revision methods share the same failure architecture as token-level overthinking: they accumulate noise without guaranteed improvement. Progressive Draft Refinement avoids this by compressing memory between iterations, outperforming longer reasoning traces at matched compute.

Why do reasoning models abandon promising solution paths?

Reasoning LLMs exhibit two reinforcing failures: wandering (invalid exploration) and underthinking (premature path-switching). Decoding-level interventions like thought-switching penalties improve accuracy without fine-tuning, suggesting viable solutions exist but are abandoned prematurely.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can context playbooks prevent knowledge loss during iteration?

The ACE framework treats contexts as evolving playbooks using generation-reflection-curation loops rather than full rewrites. This prevents knowledge loss from compression and detail erosion, achieving +10.6% on agentic tasks and +8.6% on finance without labeled supervision.

Next inquiring lines