How can language models extract more value from fewer demonstrations?
This explores how models squeeze more learning out of a small number of examples — the sample-efficiency problem — by looking at what kind of signal each demonstration carries and where the hard limits are.
This explores how models squeeze more learning out of a small number of examples, and the corpus suggests the answer is less about *more data* than about *richer signal per example*. The most direct lever is making each demonstration do double duty by pairing right and wrong answers. Small models fine-tuned with DPO on a teacher's correct-and-incorrect function-calling pairs beat the same models trained on correct examples alone, because the explicit negative shows the model exactly which format failures to avoid — a contrastive example teaches a boundary, not just a target Can small models match large models on function calling?. The same instinct shows up in a different guise: instead of importing labeled preferences, a model can mine signal it already produces. Using the model's own answer-span confidence to rank its reasoning traces creates synthetic preferences that sharpen step-by-step reasoning with zero human labels Can model confidence work as a reward signal for reasoning?, and 'post-completion learning' reuses the normally-discarded sequence space after a model finishes answering to train it to grade its own work — extra learning at zero inference cost Can models learn to evaluate their own work during training?.
The more ambitious version of the same idea is to stop needing demonstrations at all and manufacture the missing feedback. A three-role self-play loop — a Challenger that escalates difficulty, a Reasoner that attempts, and a neutral Judge that gives binary verdicts — co-evolves skills with no human supervision, effectively generating its own curriculum and reward Can language models learn skills without human supervision?. And on the architecture side, latent-thought models add a scaling dimension that isn't parameters or data: a fast-learning set of latent vectors gives strong few-shot reasoning with far better sample efficiency than scaling the model up Can latent thought vectors scale language models beyond parameters?.
But the corpus also draws a sharp line around what few demonstrations can ever buy you. Prompting and prompt optimization operate entirely inside a model's existing training distribution — they reorganize and activate knowledge that's already there, but cannot inject knowledge the model never learned Can prompt optimization teach models knowledge they lack?. Worse, when in-context examples conflict with strong parametric priors, the model often ignores the demonstration entirely; textual prompting alone can't override what training baked in Why do language models ignore information in their context?. So 'more value from fewer demonstrations' has a ceiling: examples are excellent at *steering* latent capability and terrible at *adding* it.
The deepest version of that ceiling is structural. Self-improvement — squeezing value out of the model's own outputs rather than fresh data — is formally bounded by the generation-verification gap: every reliable improvement needs something external that can verify and enforce it, so a model can't bootstrap past its limits through metacognition alone What stops large language models from improving themselves?. This is exactly why the confidence-as-reward and self-play tricks work: they smuggle in a *verifier* (a confidence signal, a binary judge) to stand in for the missing external check.
The quiet payoff here is a reframe. The question that looks like 'how do I learn from less data' is really two questions — how rich is the signal in each example (negatives and self-generated preferences beat plain positives), and is there a verifier in the loop (without one, extra demonstrations just re-shuffle what the model already believes). Fewer demonstrations work when each one carries a contrast and a check, not when you simply hand the model more correct answers to imitate.
Sources 8 notes
Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.
RLSF uses answer-span confidence to rank reasoning traces, creating synthetic preferences that strengthen step-by-step reasoning while reversing RLHF's calibration degradation—without requiring human labels or external verifiers.
Post-Completion Learning exploits unused sequence space after model output to train self-assessment capabilities during training while maintaining zero inference cost. The model learns to compute its own reward functions, internalizing evaluation rather than relying on external reward models.
Ctx2Skill's three-role self-play loop manufactures missing feedback through internal signals: the Challenger escalates difficulty as curriculum, the Judge gives binary verdicts as reward, and both sides evolve via natural-language skill edits. Success requires balancing adversarial pressure against a generalization safeguard to prevent collapse.
Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.
Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.
Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.
Self-improvement in LLMs is formally bounded by the generation-verification gap, meaning every reliable fix requires something external to validate and enforce it. Models cannot escape this constraint through metacognition alone.