INQUIRING LINE

Can instance seeds work for tasks beyond language understanding benchmarks?

This explores whether seeding a model from example instances — bootstrapping it from a handful of worked cases — actually generalizes to real tasks, or only to the language-understanding benchmarks those instances were drawn from.


This reads the question as a worry about transfer: if you seed a model with example instances, does that competence travel beyond the benchmark it was sampled from, or does it stay trapped there? The corpus doesn't use the phrase "instance seeds" verbatim, but it speaks directly to the underlying tension — and the honest headline is that seeding from benchmark-style instances tends *not* to transfer on its own, while seeding paired with an external check sometimes does.

The skeptical evidence is the loudest signal here. When researchers built out-of-distribution variants of training problems (the N-1 tests), even RL-fine-tuned models that aced the in-distribution set dropped sharply the moment the instance shape changed — suggesting the training sharpened template-matching rather than installing a transferable procedure Do fine-tuned language models actually learn optimization procedures?. The same pattern shows up when models are asked to actually *run* an iterative method: they recognize a problem as template-similar and emit plausible-but-wrong values instead of executing the steps Do large language models actually perform iterative optimization?. And you can predict in advance which non-benchmark tasks will break: framing the model as an autoregressive probability machine correctly forecasts that low-probability targets (counting letters, reversing the alphabet) collapse even when they're logically trivial Can we predict where language models will fail?. A benchmark score, in other words, certifies the benchmark — not the territory next door.

But the corpus also shows where instance-seeding genuinely escapes the benchmark, and the common ingredient is an external validator rather than the seeds themselves. Asymmetric self-play seeds a proposer that *generates its own calibrated problems*, and the solver improves with no human labels because a majority-vote check supplies the signal — an automatic curriculum that scales past any fixed test set Can language models improve themselves without any external training data?. The Darwin Gödel Machine does the same thing with code-editing agents: it seeds an evolutionary archive of variants and keeps whatever empirically benchmarks better, reaching real-task gains on SWE-bench and Polyglot Can AI systems improve themselves through trial and error?. Even small models pick up reliable function-calling when DPO seeds them with a teacher's *correct and incorrect* example pairs — the negative instances are what target the format failures plain imitation misses Can small models match large models on function calling?.

The through-line connecting both halves is the generation–verification gap: a model cannot validate its own improvement from the inside, so every reliable step beyond its current ability needs something external to confirm it What stops large language models from improving themselves?. That's why "potemkin" understanding matters as a warning — a model can correctly explain a concept, fail to apply it, and even recognize the failure, because explanation and execution run on disconnected pathways Can LLMs understand concepts they cannot apply?. Seeds that look like they encode a skill may only encode the explanation of one.

So the answer the corpus implies is sharper than yes-or-no: instance seeds carry you beyond language-understanding benchmarks *only when the seeds come bundled with a verifier that lives outside the model* — a self-generated curriculum, an empirical benchmark loop, explicit negative examples. Seed without that external check and you mostly relocate the benchmark, you don't leave it. The thing worth knowing you wanted to know is that the question of transfer is really a question about who gets to say the model got it right.


Sources 8 notes

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can we predict where language models will fail?

By framing LLMs as autoregressive probability machines, researchers predicted tasks with low-probability target responses would be systematically harder, even when logically simple. Experiments confirmed predictions like backwards alphabet and letter counting.

Can language models improve themselves without any external training data?

SQLM uses a proposer-solver framework where the proposer generates calibrated problems and the solver learns via majority-vote verification. Both agents improve through RL alone, creating an automatic curriculum that scales without human labels or ground-truth answers.

Can AI systems improve themselves through trial and error?

DGM replaces formal proofs with empirical benchmarking and maintains an evolutionary archive of agent variants, achieving 2.5× improvement on SWE-bench and 2.2× on Polyglot by discovering capabilities like better code editing and context management.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

What stops large language models from improving themselves?

Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.

Can LLMs understand concepts they cannot apply?

Models can explain concepts accurately, fail to apply them, and recognize the failure—a triple pattern incompatible with human cognition. This indicates functionally disconnected explanation and execution pathways rather than simple knowledge gaps.

Next inquiring lines