INQUIRING LINE

What failure modes do imitation and outcome methods each address?

This explores the division of labor between imitation-based training (copying demonstrations, SFT, distillation) and outcome-based training (rewarding final results, RLVR) — what each one is actually good at fixing, and where each one breaks.


This explores the division of labor between imitation methods (copying good demonstrations) and outcome methods (rewarding correct final results) — and the corpus is unusually clear that they fail in opposite directions, which is exactly why the strongest results come from chaining them. Imitation's job is to install plausible structure: it teaches a model what a reasonable answer or reasoning trace looks like. Its characteristic failure is that structure is all it installs. Can imitating ChatGPT fool evaluators into thinking models improved? shows imitation models learn ChatGPT's confident, fluent style well enough to fool human evaluators while closing zero capability gap — the factual ceiling stays pinned to the base model. Why does chain-of-thought reasoning fail in predictable ways? sharpens the point: chain-of-thought is itself a form of constrained imitation, pattern-matching the shape of reasoning rather than performing inference, which is why its failures are bounded by the training distribution and why coherence can look right while content is wrong.

Outcome methods address precisely that gap — they don't care whether the trace looks right, only whether the answer is. But they have their own signature failure: they're blind to everything that isn't the final result. Why do outcome-based reward models fail at intermediate step evaluation? shows outcome reward models systematically underrate good intermediate steps because they only ever saw final outcomes, producing high false-negative rates on the middle of a reasoning chain. And when the outcome signal is too sparse to be informative, outcome training actively corrodes capability: Do overly hard RLVR samples actually harm model capabilities? finds that near-impossible problems push models toward answer-repetition and computation-skipping shortcuts that then contaminate skills the model already had.

The lateral insight is that these two failure profiles are complementary, not competing. Does sequencing imitation then exploration training improve reasoning? makes the dependency explicit: run imitation first to create reasonable rollouts, and only then does the outcome reward become informative enough to sharpen them — each method supplies what the other lacks, and the sequence beats either alone. The same asymmetry shows up at the level of individual trajectories. Should successful and failed episodes be processed differently? and Can agents learn better from their failures than successes? argue you should treat successes as concrete demonstrations (imitate them) and failures as abstracted lessons (learn the outcome signal from them), and Why do correct code trajectories teach models to tolerate errors? operationalizes it — filter positive trajectories for clean quality, but preserve messy failures as negative signal, letting a 14B model reach frontier math performance.

What the reader probably didn't expect: the choice between imitation and outcome isn't just an efficiency trade-off, it determines the *direction* of a model's residual failures. Does training objective determine which direction models fail at abstention? shows reasoning-trained (outcome-reward) models under-abstain and over-answer because abstention is never rewarded, while differently-objectived models over-refuse. So the methods don't just fix different problems — they leave behind different, predictable failure signatures. Knowing which method dominated training tells you which way the model will be miscalibrated before you ever test it.


Sources 9 notes

Can imitating ChatGPT fool evaluators into thinking models improved?

Imitation models fool human evaluators by mimicking ChatGPT's confident, fluent style while failing to improve factuality or generalization on novel tasks. The ceiling is set by base model capability, not fine-tuning method—better fundamentals, not shortcuts, drive real improvement.

Why does chain-of-thought reasoning fail in predictable ways?

CoT guides models to pattern-match reasoning structure rather than perform genuine inference. This explains distribution-bounded failures, why structural coherence matters more than content correctness, and why performance optimizes against interpretability.

Why do outcome-based reward models fail at intermediate step evaluation?

ORMs systematically underestimate intermediate steps due to training only on final outcomes, producing high false-negative rates. PRMs solve this with step-level feedback but demand costly skilled annotation, revealing a core trade-off in reward model design.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Does sequencing imitation then exploration training improve reasoning?

Running Supervised RL first to establish reasoning foundations, then RLVR to refine against verifiable rewards, substantially outperforms both methods in isolation. The imitation phase makes outcome rewards informative by creating reasonable rollouts the RL phase can then sharpen.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Can agents learn better from their failures than successes?

ReasoningBank shows that storing strategy-level reasoning hints from both self-judged successes and failures outperforms success-only memory and raw trajectory storage. Coupled with test-time scaling, memory and compute compound rather than substitute, creating a novel scaling law where accuracy improves through cumulative interaction history.

Why do correct code trajectories teach models to tolerate errors?

GRPO-RoC filters positive trajectories for quality while preserving diverse failures as negative signal, allowing a 14B model to reach frontier math performance in 510 RL steps, surpassing much larger models with cleaner reasoning.

Does training objective determine which direction models fail at abstention?

Reasoning-trained models under-abstain and overanswer because abstention is unrewarded, while safety-trained models over-abstain and refuse benign questions. This reveals calibration is not a single fixable axis but a characteristic failure signature that depends on which objective dominated training.

Next inquiring lines