How much reasoning catalyst data is actually needed for improvement?

This reads the question as: if reasoning data acts as a 'catalyst' (activating ability rather than teaching it from scratch), how much of it do you actually need — and the corpus suggests the answer is 'surprisingly little,' because what the data does is unlock latent capability, not install new skills.

This explores how much reasoning training data you actually need to improve a model — and the through-line across the corpus is that the honest answer is 'less than you'd think,' because reasoning data tends to *activate* ability the model already has rather than build it. The clearest version of this comes from work arguing that RL post-training teaches a model *when* to reason, not *how*: base models already contain reasoning strategies in latent form, and hybrid setups recover 91% of the gains just by routing which tokens get the reasoning treatment Does RL post-training create reasoning or just deploy it?. If the capability pre-exists, then the data's job is to flip a switch, and flipping a switch doesn't take much.

The most striking evidence that *quantity isn't the lever* is that the traces don't even need to be correct. Models trained on deliberately corrupted, irrelevant reasoning steps perform comparably to models trained on clean traces — and sometimes generalize better out-of-distribution Do reasoning traces need to be semantically correct?. That implies the trace is functioning as computational scaffolding, a prompt to 'think in this shape,' rather than as a lesson the model memorizes. If the *content* of the data barely matters, piling on more of it is the wrong place to spend effort.

Where effort *does* pay off is selection and structure over volume. Step-level confidence filtering matches the accuracy of naive majority voting while generating far fewer traces, because catching a single broken reasoning step beats averaging over many Does step-level confidence outperform global averaging for trace filtering?. On the judging side, generative reward models that reason about each step outperform classifier-style ones with *orders of magnitude less training data* Can judges that reason about reasoning outperform classifier rewards?. Both point the same direction: a small amount of well-chosen, well-structured signal outperforms a large undifferentiated pile.

There's also a cautionary flip side — more isn't just unnecessary, it can be actively harmful. Supervised fine-tuning that raises benchmark accuracy can simultaneously degrade genuine reasoning quality by nearly 39%, producing right answers via post-hoc rationalization Does supervised fine-tuning improve reasoning or just answers?. And reported gains can be illusory: behavioral activation and benchmark improvement are separable, so some of what looks like 'the data worked' is memorization on contaminated test sets, not new reasoning Can genuine reasoning activation coexist with contaminated benchmarks?. Even at inference, more isn't better — accuracy peaks and then *falls* as thinking tokens climb Does more thinking time always improve reasoning accuracy?.

The thing you might not have known you wanted to know: 'how much data' may be the wrong question entirely. If reasoning data is a catalyst that activates a latent capacity, the binding constraints are *what you select* and *whether you're measuring real reasoning or just answer-matching* — not how many traces you can amass. A few hundred well-filtered examples that trip the switch can beat a corpus, and a corpus can quietly teach the model to fake it.

Sources 7 notes

Does RL post-training create reasoning or just deploy it?

Evidence shows base models already contain reasoning capability in latent form; RL training optimizes deployment timing rather than capability creation. Hybrid models recover 91% of performance gains by routing tokens only, and activation vectors for reasoning strategies pre-exist before any RL.

Do reasoning traces need to be semantically correct?

Models trained on systematically irrelevant traces maintain solution accuracy and sometimes improve out-of-distribution generalization, suggesting traces function as computational scaffolding rather than meaningful reasoning steps.

Does step-level confidence outperform global averaging for trace filtering?

Local step-level confidence catches reasoning breakdowns that global averaging masks and enables early stopping before traces complete. This approach achieves comparable accuracy gains to naive majority voting with far fewer generated traces, proving trace quality matters more than quantity.

Can judges that reason about reasoning outperform classifier rewards?

StepWiser demonstrates that training judges to produce reasoning chains about policy reasoning—rather than classify steps—yields better judgment accuracy and data efficiency. Independent confirmation from GenPRM and ThinkPRM shows generative PRMs outperform discriminative ones with orders of magnitude less training data.

Does supervised fine-tuning improve reasoning or just answers?

Supervised fine-tuning improves final-answer accuracy on benchmarks but cuts Information Gain by 38.9 percent, meaning models generate correct answers through post-hoc rationalization rather than genuine inferential steps. Standard metrics miss this degradation because they only measure final correctness.

Can genuine reasoning activation coexist with contaminated benchmarks?

RLVR activates genuine reasoning patterns through RL training while benchmark improvements may reflect data memorization on contaminated datasets. These operate at different measurement levels and can coexist without contradiction.

Does more thinking time always improve reasoning accuracy?

Increasing thinking tokens from ~1,100 to ~16K reduced benchmark accuracy from 87.3% to 70.3%, revealing a non-monotonic relationship where models overthink easy problems and underthink hard ones.

How much reasoning catalyst data is actually needed for improvement?

Sources 7 notes

Next inquiring lines