How do gradients flowing through both branches simultaneously reshape each component's role?
This explores what happens when two coupled components of a model are trained together end-to-end — whether shared gradients push each toward a specialized role or blur the line between them.
This explores joint training of paired components — say a planner and an executor, or overlapping task-specific weights — and whether letting gradients flow through both at once cleanly carves out distinct roles or muddies them. The corpus has a sharp tension on exactly this. The strongest case for keeping the branches apart comes from work showing that separating a decomposer from a solver outperforms a single monolithic model: when planning and execution are trained as one undifferentiated blob, gradients from one task interfere with the other, and notably the decomposition skill transfers across domains while the solving skill does not Does separating planning from execution improve reasoning accuracy?. That asymmetry is the key clue — joint gradients don't reshape two branches symmetrically; they tend to let one role generalize while pinning the other to surface patterns.
The interference story shows up again at the parameter level. Training multiple tasks together causes their gradient updates to collide in shared weight regions; the fix is to identify each task's 'core' parameters and freeze them while only merging the non-core remainder, which beats naive joint fine-tuning Can isolating task-specific parameters prevent multi-task fine-tuning interference?. The lesson cuts against the question's premise: gradients flowing freely through everything simultaneously degrade roles rather than refine them, unless you protect each component's territory.
But there's a counter-current worth knowing about. Neural networks left to train end-to-end do spontaneously sort themselves into modular subnetworks — pruning reveals isolated subroutines where ablating one affects only its corresponding function, and pretraining makes this self-organization more reliable Do neural networks naturally learn modular compositional structure?. So shared gradients can carve roles on their own, no architectural separation required. The catch is what those self-formed roles actually learn: transformers often reduce 'compositional reasoning' to memorized subgraph matching rather than genuine rule-following, which collapses on novel combinations Do transformers actually learn systematic compositional reasoning?. Emergent modularity is real but can be hollow.
There's also a darker reshaping that joint gradients perform quietly. In RL post-training, gradients flowing through the whole policy don't just tune behavior — they amplify one dominant format from pretraining within a single epoch and suppress the alternatives, with the winner determined by model scale rather than quality Does RL training collapse format diversity in pretrained models?. Relatedly, training different domains together produces opposing entropy dynamics, where structured tasks sharpen output while creative tasks need to stay loose, and the order you feed them mechanically reshapes the final balance Does training order reshape how models handle different task types?. The component's 'role' isn't only what it computes — it's how much diversity it retains, and unscalarized vector rewards turn out to be a way to keep a diversity axis alive instead of letting joint optimization flatten it Can reward vectors be the hidden source of solution diversity?.
The thread tying these together: simultaneous gradients are a force that wants to collapse coupled components toward a single dominant mode — one format, one memorized shortcut, one task's parameters overwriting another's. The systems that get clean role differentiation are the ones that fight that pull, whether by architectural separation, parameter isolation, or reward structures that explicitly preserve the dimensions you don't want flattened.
Sources 7 notes
Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.
Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.
Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.
Research shows transformers succeed on in-distribution tasks by memorizing computation subgraphs from training data, not by learning systematic rules. They fail drastically on novel compositions, with errors compounding across reasoning steps.
Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.
Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.
Vector Policy Optimization shows that rewards decomposed per test-case, criterion, or persona provide an inherent diversity structure. Training solutions to span the Pareto frontier across these dimensions produces competent diversity grounded in real task trade-offs rather than external regularizers.