Does network depth unlock qualitatively new behaviors in RL?

Can scaling neural network depth from shallow (2-5 layers) to very deep (1000 layers) produce fundamental shifts in what self-supervised RL agents can learn, rather than just incremental improvements? This matters because it challenges assumptions about feedback constraints in RL.

Note · 2026-02-22 · sourced from Reinforcement Learning

Most RL research uses shallow architectures (2-5 layers). Scaling network depth to 1024 layers in self-supervised RL produces 2x-50x performance improvements — but not through gradual improvement. Instead, there are pronounced jumps at critical depth thresholds that vary by environment: depth 4 produces rudimentary policies (falling, throwing toward target), depth 16 enables walking upright, depth 64 navigates simple mazes, and depth 256 produces entirely novel behaviors (leveraging body position to propel over walls, shifting into seated postures to worm through obstacles).

The mechanism is a synergy between exploration and expressivity. A controlled experiment separates these factors: deep and shallow "learner" networks train on data collected by a separate "collector" network. When the collector is deep (rich exploration data), the deep learner substantially outperforms the shallow one — expressivity matters. When the collector is shallow (poor exploration data), both learners perform equally poorly — exploration constrains everything. Neither factor alone explains the gains; scaling depth enhances both simultaneously.

This is conducted in unsupervised goal-conditioned settings with no demonstrations or rewards — the agent must explore from scratch and learn to reach commanded goals. The self-supervised contrastive RL algorithm provides the learning framework. Stabilization requires residual connections, layer normalization, and Swish activations.

The finding challenges the conventional wisdom that RL provides too few bits of feedback to train large networks. In self-supervised RL specifically, the ratio of feedback to parameters becomes less constraining because the agent generates its own training signal. Since Why does parallel reasoning outperform single chain thinking?, the depth-scaling result offers a complementary axis: scaling depth may be as important as scaling parallel breadth for unlocking qualitatively new capabilities.

Source: Reinforcement Learning

Related concepts in this collection

Why does parallel reasoning outperform single chain thinking? Does dividing a fixed token budget across multiple independent reasoning paths beat spending it all on one long chain? This explores how breadth and diversity in reasoning compare to depth.
complements: depth scaling and parallel scaling may be independent capability axes
Can extended RL training discover reasoning strategies base models cannot? Does reinforcement learning genuinely expand what models can reason about, or does it only optimize existing latent capabilities? ProRL tests this by running RL longer on diverse tasks with better training controls.
parallels: both show RL discovering qualitatively new behaviors, though in different domains (reasoning vs locomotion)
Does RL training follow a predictable two-phase learning sequence? This explores whether reinforcement learning exhibits consistent phases where basic execution skills must consolidate before strategic reasoning emerges. Understanding this sequence could reveal bottlenecks in scaling reasoning capabilities.
connects: depth thresholds may correspond to phase transitions between procedural and strategic capabilities

Concept map

16 direct connections · 150 in 2-hop network ·dense cluster

Does network depth unlock qualitatively new beha… Why does parallel reasoning outperform single chai… Can extended RL training discover reasoning strate… Does RL training follow a predictable two-phase le…

Click a node to walk · click center to open · click Open full network for a force-directed map

your link semantically near linked from elsewhere

Original note title

network depth above critical thresholds causes qualitative behavioral jumps in self-supervised rl