INQUIRING LINE

How does representational convergence differ from policy entropy collapse in iterative training?

This explores two things that both look like 'the model narrowing down' during repeated training rounds, but aren't the same: policy entropy collapse is the action distribution losing its spread of choices, while representational convergence is the model settling onto one internal format or representation among several it could have used.


This explores two failure-adjacent dynamics that both look like 'the model narrowing' during iterative training, but operate on different layers. Policy entropy collapse is about *behavior*: the distribution over what the model does. As RL training proceeds, the policy concentrates on a few reward-maximizing moves and stops exploring alternatives. The corpus pins this down with an unusually clean empirical law — performance saturates as entropy approaches zero, R = -a·exp(H) + b — and frames it as the primary ceiling on RL scaling for reasoning Does policy entropy collapse limit reasoning performance in RL?. The same squeeze shows up beyond reasoning: search agents lose behavioral diversity through the identical entropy-collapse mechanism, converging on narrow strategies Does reinforcement learning squeeze exploration diversity in search agents?.

Representational convergence is about *form*: which of several available internal styles or output formats the model commits to. Here the striking corpus result is that RL doesn't invent a new format — it amplifies one distribution already present from pretraining within the first epoch and suppresses the alternatives, and the winner is decided by model scale rather than by which format performs best Does RL training collapse format diversity in pretrained models?. So the convergence is a selection among pre-existing representations, not a loss of exploratory probability mass. That distinction matters: entropy collapse is a continuous narrowing you can measure and counteract (Clip-Cov, KL-Cov, GPPO all manage the rate of entropy reduction); format convergence is closer to a winner-take-all tipping point baked in early.

The two also have different relationships to what's reversible. Entropy collapse is partly a training-dynamics problem — SFT on diverse demonstrations restores exploration breadth that RL squeezed out Does reinforcement learning squeeze exploration diversity in search agents?, and keeping the policy close to its base distribution (low KL drift) preserves the model's plasticity to keep learning new tasks instead of stalling when the domain shifts Does staying close to the base model preserve learning ability?. Representational structure, by contrast, is laid down more in how the network organizes itself: networks learn dense activations for familiar data and stay sparse for the unfamiliar Is representational sparsity learned or intrinsic to neural networks?, and they decompose tasks into modular subnetworks that pretraining makes more consistent Do neural networks naturally learn modular compositional structure?. That representational scaffolding is what gets *selected from* when a format wins — it's the substrate, not the behavioral knob.

What ties them together is that iterative training has a phase structure, and the two phenomena dominate at different moments. RL training moves through a first phase where execution correctness drives learning and a second where strategic planning becomes the bottleneck — and tellingly, planning-token entropy *rises* while execution entropy stabilizes Does RL training follow a predictable two-phase learning sequence?. So 'entropy collapse' isn't a uniform fate across the whole model; some channels collapse (execution) while others need to stay open (planning). The thing worth walking away with: collapse and convergence aren't synonyms for the same decay. One is the policy spending its exploration budget; the other is the model committing to one of several inherited ways of representing the problem — and the interventions that fix one (entropy regularizers, SFT refresh, KL anchoring) don't touch the other.


Sources 7 notes

Does policy entropy collapse limit reasoning performance in RL?

Empirical law R = -a·exp(H) + b shows performance saturates when policy entropy approaches zero. Interventions like Clip-Cov, KL-Cov, and GPPO preserve exploratory capacity by managing entropy reduction during training.

Does reinforcement learning squeeze exploration diversity in search agents?

RL training compresses behavioral diversity in search agents through the same entropy collapse mechanism documented in reasoning—policies converge on narrow reward-maximizing strategies. SFT on diverse demonstrations preserves exploration breadth, suggesting diversity-preservation techniques are essential for RL search scaling.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Does staying close to the base model preserve learning ability?

FST-trained models stay up to 70% closer to their base distribution than parameter-only RL, and this reduced drift preserves the model's ability to learn subsequent tasks effectively. Parameter-only approaches stall when task domains change, while low KL drift enables sustained adaptation.

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Does RL training follow a predictable two-phase learning sequence?

Across eight models, RL training consistently shows a first phase where execution correctness drives learning, followed by a second phase where strategic planning becomes the bottleneck. Planning token entropy increases while execution entropy stabilizes, and concentration of optimization on planning tokens yields significant performance gains.

Next inquiring lines