Reinforcement Learning for LLMs

Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Note · 2026-02-22 · sourced from Reinforcement Learning
How should we allocate compute budget at inference time?

Most RL research uses shallow architectures (2-5 layers). Scaling network depth to 1024 layers in self-supervised RL produces 2x-50x performance improvements — but not through gradual improvement. Instead, there are pronounced jumps at critical depth thresholds that vary by environment: depth 4 produces rudimentary policies (falling, throwing toward target), depth 16 enables walking upright, depth 64 navigates simple mazes, and depth 256 produces entirely novel behaviors (leveraging body position to propel over walls, shifting into seated postures to worm through obstacles).

The mechanism is a synergy between exploration and expressivity. A controlled experiment separates these factors: deep and shallow "learner" networks train on data collected by a separate "collector" network. When the collector is deep (rich exploration data), the deep learner substantially outperforms the shallow one — expressivity matters. When the collector is shallow (poor exploration data), both learners perform equally poorly — exploration constrains everything. Neither factor alone explains the gains; scaling depth enhances both simultaneously.

This is conducted in unsupervised goal-conditioned settings with no demonstrations or rewards — the agent must explore from scratch and learn to reach commanded goals. The self-supervised contrastive RL algorithm provides the learning framework. Stabilization requires residual connections, layer normalization, and Swish activations.

The finding challenges the conventional wisdom that RL provides too few bits of feedback to train large networks. In self-supervised RL specifically, the ratio of feedback to parameters becomes less constraining because the agent generates its own training signal. Since Why does parallel reasoning outperform single chain thinking?, the depth-scaling result offers a complementary axis: scaling depth may be as important as scaling parallel breadth for unlocking qualitatively new capabilities.


Source: Reinforcement Learning

Related concepts in this collection

Concept map
16 direct connections · 150 in 2-hop network ·dense cluster

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere
Original note title

network depth above critical thresholds causes qualitative behavioral jumps in self-supervised rl