INQUIRING LINE

What makes content informative and not-yet-mastered for reinforcement during pretraining?

This explores what signal tells a model that a piece of training text is worth reinforcing during pretraining — what makes content count as both informative (it teaches something) and not-yet-mastered (the model doesn't already know it).


This explores what signal tells a model that a piece of training text is worth reinforcing during pretraining — the question of how to recognize content that's both informative and not-yet-mastered. The cleanest answer in the corpus comes from PretrainZero, which shows that the gains from running reinforcement learning during ordinary pretraining come not from new data but from *which* content gets reinforced: the method actively selects passages the model hasn't yet mastered, and this beats both standard pretraining and randomly chosen reinforcement Can reinforcement learning improve models during general pretraining?. So "not-yet-mastered" is operationalized as a selection target — the model's own uncertainty becomes the filter.

The complementary idea is how to define "informative" without a human grader or a verifier. RLP reframes chain-of-thought as an *exploratory action* taken during pretraining and rewards it by how much it improves the model's prediction of the next text — log-likelihood improvement as a verifier-free, information-gain reward Can chain-of-thought reasoning be learned during pretraining itself?. That's the crisp operational definition the question is reaching for: content is informative exactly when conditioning on it (or on a reasoning step) raises the probability the model assigns to what comes next. Information gain and not-yet-mastered are two sides of one coin — a passage only carries gain if the model couldn't already predict it.

There's a sharp boundary condition lurking here, though. Work on knowledge priming finds that whether a fact actually sticks after gradient updates is predictable from its *pre-learning* keyword probability, with a threshold around 10^-3 separating contexts where learning takes hold from those where it doesn't Can we predict keyword priming before learning happens?. The implication is double-edged: content that's too far below the model's current reach may not be learnable from a few exposures, while content already above a probability ceiling is already mastered and yields little. The sweet spot for reinforcement is the band in between — surprising enough to be informative, reachable enough to be absorbed.

This connects to a deeper debate about whether reinforcement is teaching anything new at all. Several lines argue that post-training mostly *elicits* capability already latent in the base model rather than creating it — RL steering, critique tuning, and decoding tricks all surface reasoning that was already in the activations Do base models already contain hidden reasoning ability?. But the picture is domain-conditional: for standard reasoning RL activates existing ability, while for complex multi-step planning it can generate genuinely novel strategies the base model can't reach even with heavy sampling Does reinforcement learning create new reasoning abilities or activate existing ones?. "Not-yet-mastered" therefore isn't one thing — sometimes it means latent-but-dormant, sometimes truly absent.

A last twist on what's worth reinforcing: it may be the failures, not the successes. Negative reinforcement alone — suppressing incorrect trajectories — often matches or beats full RL, because positive-only reinforcement concentrates probability mass and collapses diversity Does negative reinforcement alone outperform full reinforcement learning?. And differential trajectory processing treats successes as concrete demonstrations and failures as abstracted lessons, getting more learning per unit of context Should successful and failed episodes be processed differently?. So the informative signal isn't only "what the model doesn't yet predict" — it's also "where the model is confidently wrong," which is the most actionable form of not-yet-mastered there is.


Sources 7 notes

Can reinforcement learning improve models during general pretraining?

PretrainZero shows that RL during pretraining on Wikipedia, combined with active selection of not-yet-mastered content, outperforms standard pretraining and random reinforcement. The gain comes from *which* content is reinforced, not new data.

Can chain-of-thought reasoning be learned during pretraining itself?

RLP treats CoT as exploratory action during pretraining, using log-likelihood improvement as verifier-free reward. Applied to Qwen3-1.7B and Nemotron-Nano-12B, the method improves math and science benchmarks substantially, suggesting reasoning can be planted earlier in training.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Does reinforcement learning create new reasoning abilities or activate existing ones?

For standard reasoning tasks, RL activates latent abilities already present in base models. For complex planning requiring multi-step coordination, RL generates genuinely novel strategies inaccessible to base models even with extensive sampling.

Does negative reinforcement alone outperform full reinforcement learning?

Training with only negative samples consistently improves Pass@k across the spectrum, often matching full PPO and GRPO. Negative reinforcement suppresses incorrect trajectories while preserving diversity, whereas positive-only reinforcement degrades higher-k performance by concentrating probability mass.

Should successful and failed episodes be processed differently?

SkillRL demonstrates that treating successful episodes as concrete demonstrations and failures as abstracted lessons achieves state-of-the-art performance on complex tasks while using substantially less context than uniform approaches. The asymmetry mirrors human expert reasoning and avoids the degradation seen in uniform consolidation methods.

Next inquiring lines