What emergent behaviors do models develop when trained on underspecified pedagogical tasks?
This explores what new, unprogrammed habits a model picks up when its training data is deliberately incomplete or loosely specified — the 'pedagogical' framing being how the gaps in the teaching shape what the model learns to do on its own.
This explores what new, unprogrammed habits a model picks up when its training data is deliberately incomplete or loosely specified — and the corpus has a surprisingly coherent story here, told from several angles. The cleanest example: when models are trained only on fully-specified problems and then meet a problem missing key information, some learn to stop and ask a clarifying question instead of guessing. Nobody trained that move directly; it emerged as a meta-strategy of treating the conversation itself as a place to go find what's missing Can models learn to ask clarifying questions without explicit training?. The interesting twist is that the gap in the task taught a behavior the complete examples never demonstrated.
But 'underspecified' cuts a second way, and it's worth noticing the tension. One line of work argues that much of instruction tuning is *more* underspecified than we think — models trained on semantically empty or even deliberately wrong instructions perform about as well as those given correct ones, because what actually transfers is the shape of the expected output, not understanding of the task Does instruction tuning teach task understanding or output format?. So one underspecified setup grows a genuine new skill (asking questions), while another reveals the model was only ever learning format. The difference seems to be whether the task structure rewards seeking missing information versus just matching an answer template.
A recurring theme is that these 'emergent' behaviors are often latent capabilities being *elicited* rather than newly built. Base models already carry reasoning circuitry that minimal training merely selects and surfaces Do base models already contain hidden reasoning ability?, and models turn out to track what they do and don't know through entity-recognition mechanisms that steer hallucination and refusal — self-knowledge that wasn't explicitly trained Do models know what they don't know?. In the same family, fine-tuned models can accurately describe their own learned behaviors without ever being trained to introspect Can language models describe their own learned behaviors?. Loose training doesn't write new abilities so much as unlock and expose ones already encoded.
Where the supervision signal is genuinely missing, models can also manufacture their own. Self-play loops co-evolve skills with no human labels by having the model generate its own curriculum and verdicts Can language models learn skills without human supervision?; post-completion learning lets a model internalize self-evaluation in the unused space after its output Can models learn to evaluate their own work during training?; and training on messy search traces — mistakes and backtracking included rather than clean optimal answers — produces better problem-solvers that build internal world models for exploration Does training on messy search processes improve reasoning?. Underspecification here becomes a feature: the gaps are exactly what force the model to develop strategy rather than memorize a path.
One caution worth carrying out of this, the thing you didn't know you wanted to know: be skeptical of the word 'emergent' itself. A pointed result shows that sharp, surprising capability jumps often vanish when you measure with continuous metrics instead of all-or-nothing ones — the 'emergence' was partly a measurement choice, not a real discontinuity Are LLM emergent abilities real or measurement artifacts?. So when you see a model 'spontaneously' develop a behavior under loose training, the honest question is whether something genuinely new appeared, an existing capability got elicited, or the metric made smooth improvement look like a leap.
Sources 9 notes
Models trained via SML on complete problems generalize to underspecified tasks by asking for needed information and delaying answers. The training paradigm instills a meta-strategy of using conversation as an information source, addressing the premature-answering failure mode.
Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Sparse autoencoders revealed that language models develop causal mechanisms for detecting whether they know facts about entities. These mechanisms actively steer both hallucination and refusal behavior, and persist from base models into finetuned chat versions.
LLMs fine-tuned on datasets exhibiting specific behaviors accurately describe those behaviors without any training to self-report. This suggests behavioral regularities are encoded and accessible in ways that factual knowledge often is not.
Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Stream of Search pretraining, which represents exploration and backtracking as serialized strings, achieves 25% higher accuracy than optimal-trajectory-only training. Models learn internal world models for search and adaptive strategies rather than fixed external methods.
Sharp, unpredictable capability transitions vanish when using continuous metrics instead of discontinuous ones. The same model outputs show smooth predictable improvement with scale, suggesting emergence is a measurement choice rather than a real behavioral change.