What performance trade-offs emerge when composing multiple independently trained model capabilities?

This explores what you give up — and what you gain — when you stitch together capabilities that were each trained on their own, rather than training one model to do everything at once.

This explores what happens when you combine separately-trained skills into one system: the corpus suggests composition is real and often beneficial, but it runs into three recurring taxes — interference, collapse, and hidden convergence. The optimistic baseline: capabilities really do live in modular, recombinable pieces. Pruning experiments show neural networks naturally split compositional tasks into isolated subnetworks, where ablating one only breaks its matching function Do neural networks naturally learn modular compositional structure?. And you can exploit this at inference — Transformer² tunes only the singular values of weight matrices to produce 'expert vectors' that mix on the fly without stepping on each other, beating LoRA with fewer parameters Can models dynamically activate expert skills at inference time?. So composition isn't a hack; it's working with the grain of how models already organize skills.

The first tax is interference: trained skills fight when forced to share parameters. Naively fine-tuning on multiple tasks degrades them, and the fix is structural — identify each task's 'core' parameter regions, freeze them, and only merge the non-core remainder. Notably, just scheduling tasks in a clever order isn't enough; you need explicit isolation Can isolating task-specific parameters prevent multi-task fine-tuning interference?. Function-calling tells the cooperative version of the same story: breaking the capability into seven granular subtasks and multi-task training across them generalizes better than one umbrella dataset Can breaking function calling into subtasks improve model generalization?. The lesson across both: how you carve the boundary between capabilities determines whether they reinforce or corrupt each other.

The second tax is collapse — and it's the one most invisible to standard metrics. Models can hold multiple complete tasks in superposition during inference, but autoregressive decoding forces a winner after the very first token, so you can't actually emit two task-behaviors at once Can LLMs handle multiple tasks at once during inference?. Worse, the order you compose in mechanically reshapes capability: structured tasks drive output entropy down while creative tasks drive it up, so training structured-first prevents entropy collapse from damaging open-ended skills — worth 6.2% over joint training Does training order reshape how models handle different task types?. The same domain-dependence shows up in preference tuning, which reduces diversity in code but increases it in creative writing Does preference tuning always reduce diversity the same way?. Composition isn't order-neutral; sequencing is a performance knob.

The sharpest trade-off, though, is one you'd never see on a benchmark. A model can carry every linearly-decodable feature a task needs — perfect accuracy — while its internal organization is fractured, leaving it fragile to perturbation and distribution shift Can models be smart without organized internal structure?. So two capabilities can compose to a clean score while quietly sharing a brittle substrate. And the classic reason to compose — ensemble diversity — may be partly illusory: across 70+ models on 26K open-ended queries, independently-trained models converge on near-identical outputs (an 'Artificial Hivemind') because they share training data and alignment recipes Do different AI models actually produce diverse outputs?. The thing you didn't know you wanted to know: 'independently trained' often isn't independent enough to pay off, so the real frontier isn't adding more experts — it's carving boundaries (parameter regions, task order, decoding) so that composed skills stay genuinely distinct instead of either colliding or collapsing into one.

Sources 9 notes

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Can LLMs handle multiple tasks at once during inference?

Large language models represent multiple complete, computationally distinct tasks simultaneously during inference—a macroscopic phenomenon separate from feature-level superposition. However, autoregressive decoding forces convergence to a single task after the first token, preventing practical multi-task generation.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does preference tuning always reduce diversity the same way?

RLHF reduces lexical-syntactic diversity in code generation but increases it in creative writing. The direction depends on what each domain incentivizes: code rewards convergence toward correct solutions, while creative writing rewards stylistic distinctiveness.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Do different AI models actually produce diverse outputs?

INFINITY-CHAT analyzed 70+ models across 26K open-ended queries and found an "Artificial Hivemind" effect: models independently generate strikingly similar or identical responses due to overlapping training data and alignment procedures, undermining the diversity benefits of model ensembles.

What performance trade-offs emerge when composing multiple independently trained model capabilities?

Sources 9 notes

Next inquiring lines