How does a single training example trigger phase transitions in reasoning output?
This explores how a *single* training example can flip a model's reasoning behavior so dramatically — and why the corpus says the answer is that the example isn't teaching new skills, it's flipping a switch on capability that was already there.
This explores how one training example can trigger a phase transition in reasoning — and the corpus's striking answer is that the example isn't building reasoning, it's *activating* reasoning the model already had. The clearest data point: in RLVR (reinforcement learning with verifiable rewards), a single training example lifts math performance from 36% to 73.6%, and — stranger still — test accuracy keeps climbing for 1,400 steps *after* training accuracy already hit 100% Can a single training example unlock mathematical reasoning?. That post-saturation generalization is the signature of a phase transition: the model isn't memorizing the example, it's being tipped into a different operating regime.
Why would one example be enough? Because the reasoning capacity is latent in the base model, waiting to be elicited. Five independent methods — RL steering, critique fine-tuning, decoding tweaks, sparse-autoencoder feature steering, and RLVR — all unlock reasoning that already exists in base-model activations Do base models already contain hidden reasoning ability?. The bottleneck is elicitation, not acquisition. Seen this way, a single example works like a key, not a curriculum: it selects an existing mode rather than creating a new one. A complementary framing argues RL post-training teaches the model *when* to reason, not *how* — hybrid models recover 91% of the gains by just routing tokens, and reasoning-strategy activation vectors pre-exist before any RL touches the weights Does RL post-training create reasoning or just deploy it?.
If reasoning is being switched on rather than taught, you'd expect the *content* of the training signal to matter less than its role as a trigger — and that's exactly what shows up. Models trained on deliberately corrupted, semantically irrelevant reasoning traces perform comparably to those trained on correct ones, sometimes generalizing *better* out of distribution Do reasoning traces need to be semantically correct?. The traces act as computational scaffolding, not meaningful logic. That fits the broader finding that chain-of-thought is constrained imitation of reasoning *form* rather than genuine inference Does chain-of-thought reasoning reveal genuine inference or pattern matching?, What makes chain-of-thought reasoning actually work?. A trigger that activates a latent pattern doesn't need to be a good example — it just needs to point at the right pattern.
The deeper question the phase-transition framing raises: where did the latent capability come from in the first place? Analysis of 5 million pretraining documents suggests reasoning rides on broad, transferable *procedural* knowledge absorbed across many sources, distinct from the narrow document-specific memorization behind factual recall Does procedural knowledge drive reasoning more than factual retrieval?. So pretraining lays down the reasoning machinery diffusely; a single later example just trips the switch. Some researchers are now trying to move that activation earlier, planting CoT as an exploratory action *during* pretraining with information-gain rewards Can chain-of-thought reasoning be learned during pretraining itself?.
The catch worth knowing: a switch flipped this cheaply has limits the gains can hide. CoT degrades predictably under distribution shift — fluent but logically inconsistent once you move outside the training distribution Does chain-of-thought reasoning actually generalize beyond training data? — and failures track *instance-level novelty*, not task complexity, because models fit instance patterns rather than general algorithms Do language models fail at reasoning due to complexity or novelty?. So a single example can produce a dramatic jump on familiar territory and still leave the underlying reasoning brittle where it counts. The phase transition is real — it's a transition in *deployment*, not in raw capability.
Sources 10 notes
A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.
Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.
Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.
Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.
CoT works by constraining models to reproduce familiar reasoning patterns from training, not by enabling novel symbolic reasoning. Performance degrades predictably under distribution shifts—the signature of imitation rather than capability emergence.
CoT systems reproduce the form of reasoning through pattern matching rather than performing genuine logical inference. This explains why format effects dominate content, why structurally invalid prompts succeed, and why stronger reasoning models become less instruction-compliant.
Analysis of 5 million pretraining documents shows reasoning relies on broad, transferable procedural knowledge from diverse sources, unlike factual recall which depends on narrow, document-specific memorization of target facts.
RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.
DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.
LRMs don't break at complexity thresholds but at instance-novelty boundaries. Models fit instance-based patterns rather than generalizable algorithms, so any reasoning chain succeeds if trained on similar instances, regardless of length.