What emergent behaviors do models develop when trained on underspecified pedagogical tasks?

This explores what new, unprogrammed habits a model picks up when its training data is deliberately incomplete or loosely specified — the 'pedagogical' framing being how the gaps in the teaching shape what the model learns to do on its own.

This explores what new, unprogrammed habits a model picks up when its training data is deliberately incomplete or loosely specified — and the corpus has a surprisingly coherent story here, told from several angles. The cleanest example: when models are trained only on fully-specified problems and then meet a problem missing key information, some learn to stop and ask a clarifying question instead of guessing. Nobody trained that move directly; it emerged as a meta-strategy of treating the conversation itself as a place to go find what's missing Can models learn to ask clarifying questions without explicit training?. The interesting twist is that the gap in the task taught a behavior the complete examples never demonstrated.

But 'underspecified' cuts a second way, and it's worth noticing the tension. One line of work argues that much of instruction tuning is *more* underspecified than we think — models trained on semantically empty or even deliberately wrong instructions perform about as well as those given correct ones, because what actually transfers is the shape of the expected output, not understanding of the task Does instruction tuning teach task understanding or output format?. So one underspecified setup grows a genuine new skill (asking questions), while another reveals the model was only ever learning format. The difference seems to be whether the task structure rewards seeking missing information versus just matching an answer template.

A recurring theme is that these 'emergent' behaviors are often latent capabilities being *elicited* rather than newly built. Base models already carry reasoning circuitry that minimal training merely selects and surfaces Do base models already contain hidden reasoning ability?, and models turn out to track what they do and don't know through entity-recognition mechanisms that steer hallucination and refusal — self-knowledge that wasn't explicitly trained Do models know what they don't know?. In the same family, fine-tuned models can accurately describe their own learned behaviors without ever being trained to introspect Can language models describe their own learned behaviors?. Loose training doesn't write new abilities so much as unlock and expose ones already encoded.

Where the supervision signal is genuinely missing, models can also manufacture their own. Self-play loops co-evolve skills with no human labels by having the model generate its own curriculum and verdicts Can language models learn skills without human supervision?; post-completion learning lets a model internalize self-evaluation in the unused space after its output Can models learn to evaluate their own work during training?; and training on messy search traces — mistakes and backtracking included rather than clean optimal answers — produces better problem-solvers that build internal world models for exploration Does training on messy search processes improve reasoning?. Underspecification here becomes a feature: the gaps are exactly what force the model to develop strategy rather than memorize a path.

One caution worth carrying out of this, the thing you didn't know you wanted to know: be skeptical of the word 'emergent' itself. A pointed result shows that sharp, surprising capability jumps often vanish when you measure with continuous metrics instead of all-or-nothing ones — the 'emergence' was partly a measurement choice, not a real discontinuity Are LLM emergent abilities real or measurement artifacts?. So when you see a model 'spontaneously' develop a behavior under loose training, the honest question is whether something genuinely new appeared, an existing capability got elicited, or the metric made smooth improvement look like a leap.

Sources 9 notes

Can models learn to ask clarifying questions without explicit training?

Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Do models know what they don't know?

Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.

Can language models describe their own learned behaviors?

LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.

Can language models learn skills without human supervision?

Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.

Can models learn to evaluate their own work during training?

Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.

Does training on messy search processes improve reasoning?

Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.

Are LLM emergent abilities real or measurement artifacts?

Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a capability researcher re-testing claims about emergent behavior in underspecified training regimes. The question remains: What genuine new behaviors do models develop when trained on incomplete or loosely specified tasks — and which are elicited latent capabilities or measurement artifacts?

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026. Key claims:
• Models trained only on fully-specified problems spontaneously learn to ask clarifying questions when encountering underspecified tasks (~2023), but instruction tuning teaches output-format distribution, not task understanding (~2023).
• Much "emergence" vanishes under continuous metrics; sharp capability jumps are partly measurement artifacts, not fundamental discontinuities (2023).
• Base models already encode latent reasoning and self-knowledge circuitry; minimal training elicits rather than builds (~2024–2025).
• Self-play without labels, post-completion learning in unused token space, and training on messy search traces co-evolve skills without human supervision (~2024–2025).
• Models develop behavioral self-awareness and can introspect on learned behaviors without explicit training (~2025).

Anchor papers (verify; mind their dates):
• 2304.15004 (Apr 2023): Are Emergent Abilities a Mirage?
• 2305.11383 (May 2023): Do Models Really Learn to Follow Instructions?
• 2411.14257 (Nov 2024): Knowledge Awareness and Hallucinations
• 2501.11120 (Jan 2025): LLMs Aware of Their Learned Behaviors

Your task:
(1) RE-TEST the elicitation-vs.-emergence tension. Since Jan 2025, has evidence strengthened that underspecified training *elicits* latent capabilities rather than forging new ones? Or have newer models, better interpretability, or continuous measurement methods clarified whether something genuinely *novel* still emerges? Where does the boundary hold?
(2) Surface the strongest *contradicting* work from the last 6 months — papers showing that underspecification does produce truly novel behaviors unattributable to latent structure, or that "behavioral self-awareness" is confabulation, not genuine introspection.
(3) Propose 2 research questions assuming the regime may have moved: (a) If elicitation dominates, how can we design underspecified tasks that force *architectural* novelty rather than just unmasking the existing? (b) If measurement artifact remains, what continuous or compositional metric would finally settle whether emergent behavior is real?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

What emergent behaviors do models develop when trained on underspecified pedagogical tasks?

Sources 9 notes

Next inquiring lines