How much training data is truly necessary to unlock latent model reasoning?

This explores whether reasoning has to be taught with large datasets, or whether it's already latent in a trained model and only needs a small nudge to surface — and where the limits of that nudge are.

This reads the question as being less about "how do we teach a model to reason" and more about "how little does it take to switch on reasoning a model already has." On that framing, the corpus is surprisingly emphatic: the answer is often *startlingly little*. The strongest version of this comes from work showing that base models already contain latent reasoning ability, and that five completely different techniques — RL steering, critique fine-tuning, decoding tweaks, sparse-feature steering, and RLVR — all converge on the same conclusion: post-training *selects* reasoning that's already there rather than creating it Do base models already contain hidden reasoning ability?. The bottleneck is elicitation, not acquisition.

How little? In the RLVR setting, a single training example can be enough to activate a model's reasoning, and — strikingly — spurious or even incorrect rewards work nearly as well as correct ones, provided the model was pretrained well What does reward learning actually do to model reasoning?. Reward learning here isn't injecting a new skill; it's reweighting strategies the model already knows. The same shape shows up in cheaper interventions: a single steering vector extracted from just 50 paired examples can cut chain-of-thought length by two-thirds without losing accuracy, no retraining at all Can we steer reasoning toward brevity without retraining?. If reasoning behavior lives along a steerable direction in activation space, then "training data" starts to look like the wrong unit of measurement entirely.

But "unlock" has a hard ceiling, and this is the part a curious reader might not expect. You can only activate what's already in the distribution. Prompt optimization cannot supply knowledge a model never learned — it can only reorganize what exists Can prompt optimization teach models knowledge they lack?. And the reasoning that gets unlocked is semantic, not symbolic: strip the familiar meaning out of a task and performance collapses even when the correct rules are sitting right there in the prompt Do large language models reason symbolically or semantically?. So minimal data unlocks minimal-distance reasoning — it surfaces capability, it doesn't extend the frontier.

There's also a sharp warning against assuming "more/harder data = more reasoning." Training on near-impossible RLVR problems doesn't stretch the model; it teaches degenerate shortcuts — answer repetition, computation-skipping — that then contaminate capabilities the model already had Do overly hard RLVR samples actually harm model capabilities?. The right data is calibrated data, not abundant data. Meanwhile the gap between reasoning and non-reasoning models doesn't close by throwing inference compute at it — the training regime installs a *protocol* that makes the extra tokens productive, which no amount of test-time budget recreates Can non-reasoning models catch up with more compute?.

The quietly radical thread underneath all of this: if reasoning is latent and data is almost incidental to eliciting it, maybe the lever isn't data at all but *architecture and inference*. Energy-based transformers reach System-2-style deliberation purely from unsupervised learning, with no domain-specific reasoning data or scaffolding Can energy minimization unlock reasoning without domain-specific training?. Latent-thought models add scaling dimensions that are independent of parameters and of training corpus size Can latent thought vectors scale language models beyond parameters?, and stochastic latent reasoning scales reasoning *wider* — sampling parallel trajectories — rather than demanding more training Can reasoning systems scale wider instead of only deeper? Can stochastic latent reasoning help models explore multiple solutions?. Taken together, the corpus suggests the honest answer to "how much data?" is: far less than the field assumed for elicitation, and possibly the wrong question for everything beyond it.

Sources 11 notes

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

What does reward learning actually do to model reasoning?

Research shows RLVR improves sampling efficiency within existing capability boundaries without expanding them. A single training example suffices for activation, and spurious rewards work nearly as well as correct ones for models with appropriate pretraining.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Do overly hard RLVR samples actually harm model capabilities?

Training on nearly-impossible problems causes models to learn degenerate shortcuts rather than genuine reasoning, and these shortcuts contaminate pre-existing capabilities. Group-relative normalization treats rare accidental successes as high-advantage trajectories, reinforcing answer repetition and computation-skipping instead of sound reasoning patterns.

Can non-reasoning models catch up with more compute?

Reasoning models persistently outperform non-reasoning models regardless of inference budget because training instills a reasoning protocol that makes additional tokens productive. The gap is fundamentally about deployment mechanisms and training structure, not raw capability.

Can energy minimization unlock reasoning without domain-specific training?

Energy-Based Transformers assign energy values to input-prediction pairs and use gradient descent minimization for inference, yielding 35% higher training scaling rates and 29% more inference-compute gains than Transformer++, while generalizing better on out-of-distribution data without domain-specific scaffolding.

Can latent thought vectors scale language models beyond parameters?

Latent-Thought Language Models achieve superior sample and parameter efficiency by coupling fast local variational learning with slow global decoder learning. This dual-rate scheme scales few-shot reasoning across both model and latent size, creating independent scaling dimensions beyond traditional parameter scaling.

Can reasoning systems scale wider instead of only deeper?

GRAM shows that stochastic latent transitions enabling parallel trajectory sampling sidestep the serial latency cost of depth-only scaling. Width matches token-level parallelism benefits: independent paths sample the solution space without variance inflation.

Can stochastic latent reasoning help models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent distributions over solutions rather than single predictions. This allows handling of ambiguous problems and multiple valid strategies that deterministic designs cannot represent.

How much training data is truly necessary to unlock latent model reasoning?

Sources 11 notes

Next inquiring lines