Reinforcement Learning for LLMs LLM Reasoning and Architecture

Does RL training collapse format diversity in pretrained models?

Exploring whether RL fine-tuning systematically selects one output format from pretraining while suppressing others, and how this selection mechanism drives performance gains.

Note · 2026-02-22 · sourced from Reasoning Critiques
How should we allocate compute budget at inference time? How do you build domain expertise into general AI models?

A study with full pretraining transparency (models pretrained from scratch on known open datasets) reveals a striking structural pattern: RL fine-tuning does not simply improve reasoning — it systematically selects for and amplifies a single format from the pretraining mixture while collapsing all others.

The mechanism: early in RL training (within the first epoch), the model shifts toward generating outputs in the format of one specific distribution — code-like formats for smaller models, natural language formats for larger models. This transition coincides with the largest accuracy gain, suggesting the selection of a dominant format is what drives improvement, not a gradual enhancement across all formats.

Key findings:

This is distinct from Does policy entropy collapse limit reasoning performance in RL? in an important way. Entropy collapse describes diversity reduction within an output distribution. The echo chamber finding describes distribution selection: RL picks one distribution and amplifies it at the expense of all others. It is a format-level convergence, not just a diversity-level collapse.

The implication for practitioners: RL fine-tuning results depend on what the pretraining data mixture looks like, but this dependence is largely hidden when starting from existing pretrained models whose training data is proprietary. The performance gains attributed to RL algorithms may partially reflect which pretraining distribution was selected, not algorithmic superiority.


Source: Reasoning Critiques

Related concepts in this collection

Concept map
16 direct connections · 142 in 2-hop network ·medium cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

rl post-training converges on a single dominant pretraining distribution format, suppressing all others