Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining

Paper · arXiv 2504.07912 · Published April 10, 2025
Reasoning CritiquesFlawsReinforcement LearningReward Models

Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models for advanced mathematical reasoning and coding. Following the success of frontier reasoning models, recent work has demonstrated that RL fine-tuning consistently improves performance, even in smaller-scale models; however, the underlying mechanisms driving these improvements are not well-understood. Understanding the effects of RL fine-tuning requires disentangling its interaction with pretraining data composition, hyperparameters, and model scale, but such problems are exacerbated by the lack of transparency regarding the training data used in many existing models. In this work, we present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch on different mixtures of fully open datasets. We investigate the effects of various RL fine-tuning algorithms (PPO, GRPO, and Expert Iteration) across models of different scales. Our study reveals that RL algorithms consistently converge towards a dominant output distribution, amplifying patterns in the pretraining data. We also find that models of different scales trained on the same data mixture will converge to distinct output distributions, suggesting that there are scale-dependent biases in model generalization. Moreover, we find that RL post-training on simpler questions can lead to performance gains on harder ones, indicating that certain reasoning capabilities generalize across tasks. Our findings show that small-scale proxies in controlled settings can elicit interesting insights regarding the role of RL in shaping language model behavior.1

While RL post-training has demonstrated empirical success, the underlying mechanisms driving these improvements are being actively studied. Several hypotheses have been proposed to explain the effectiveness of RL, including its potential to encourage longer chains of thought (Wei et al., 2022; Yeo et al., 2025), facilitate backtracking behaviors (Guo et al., 2025), generalize to unseen task variants (Chu et al., 2025), and improve overall reasoning accuracy. However, a limitation of these studies is their lack of control over the pretraining data—an increasingly recognized factor in providing the proper model initialization needed for effective fine-tuning (Abdin et al., 2024; Allal et al., 2025; Petty et al., 2024; Penedo et al., 2024). This gap is especially salient given that most existing reproductions and analyses begin from base models whose pretraining datasets are either proprietary or insufficiently documented.

In this work, we seek to clarify the relationship between pretraining data and RL-based post-training. Specifically, we ask the following: how does the composition of pretraining data affect the efficacy of RL fine-tuning? And how does this interaction depend on the choice of RL algorithm, the choice of hyperparameters, and model scale? To answer these questions, we construct a controlled experimental setting that allows us to systematically examine these factors, providing a clearer picture of how pretraining and RL jointly shape model behavior.

To isolate the effects of RL fine-tuning, we pretrain language models from scratch on curated mixtures of open-source datasets, including both document-style corpora and synthetic instruction datasets with diverse characteristics. This setup gives us full control over what the model is exposed to during pretraining and allows us to track the influence of specific instruction datasets. We then fine-tune these models using reinforcement learning on mathematical question-answering tasks. This controlled setting enables us to monitor both quantitative and qualitative shifts in the model’s generations across different stages of training, offering a clearer view into the mechanisms by which RL fine-tuning interacts with pretraining data.

• We conduct a principled investigation of RL fine-tuning starting from models of various scales that we have pretrained from scratch on mixtures of fully open datasets (Section 2).

• We find that RL fine-tuning consistently drives models to converge on generating outputs in the format of a single pretraining distribution (Section 3.1), often yielding improved pass@1 accuracy but reduced diversity. Despite occasional failure cases (Section 3.2), the preferred distribution is typically the most performant one - as measured on the base model’s accuracy restricted to the specific distribution. Qualitative properties within the preferred distribution are also further refined during RL fine-tuning (Section 3.3).

• The preferred distribution reveals a scale-dependent bias: smaller models favor simpler, code-like formats, while larger models shift toward natural language outputs (Section 3.4).

• We provide evidence of positive transfer from RL fine-tuning, showing that models improve on evaluation datasets not seen during post-training (Section 4).

We begin by highlighting a striking pattern consistently observed during RL fine-tuning across all pretraining data mixtures: the model rapidly converges to producing outputs that follow the format of a single data distribution seen during pretraining, suppressing the other ones.

The model quickly shifts toward generating answers in the format of one distribution—TinyGSM in this case—within the first epoch (note the log-scaled x-axis). This transition coincides with the largest gain in overall pass@1 accuracy.

3.2 RL doesn’t always favor the most performant, nor the most common distribution

In the previous section, we observed that RL fine-tuning amplifies generations coming from one distribution, while down-weighting the others. This raises a natural question: does the model consistently favor the distribution that yields the best performance, or the distribution with the highest proportion of generations at initialization? We find that the answer is nuanced and can depend on the pretraining data mixture. • RL fine-tuning amplifies a specific mode from the pretraining mixture while collapsing the others.

• The mode that gets amplified depends on the scale of the model, and the degree of amplification depends on the hyperparameters - namely, the coefficient for the KL penalty.

• RL post-training on simpler datasets such as GSM8K gives a performance boost on harder mathematical datasets such as MATH, and to a lesser extent on AIME.

• Small-scale proxies can offer valuable insights into the scientific aspects of RL fine-tuning in LLMs.

Our work opens up several exciting research directions towards understanding RL post-training and extracting more performance from these models. One potential question is how our results extend to more complicated data mixtures, such as including multilingual data in the mix. Moreover, is there a notion of an optimal pretraining mixture that would lead to the best reasoning performance downstream, and how does this mixture differ across model scales?

Crucially, we believe that one major confounder in the existing literature is the reliance on pretrained models. While several open-source reasoning models are openly available, the pretraining datasets are not public, which is a critical aspect of the performance of the base models on reasoning tasks (Yang et al., 2024; Grattafiori et al., 2024). Naturally, this discrepancy gets amplified in downstream fine-tuning and evaluation, leading to spurious conclusions about the abilities and behaviors of these models. We believe that studying LLM fine-tuning in controlled settings starting from scratch is a necessary and underexplored avenue for research, amenable for exploring in academic settings using the small scale proxies introduced in this manuscript.