INQUIRING LINE

Can we predict which tasks will decompose into modular subnetworks?

This explores whether we can know in advance which tasks a neural network will internally split into clean, separable modules — and what 'modular' even means once you look past the accuracy score.


This reads the question two ways at once: whether networks *naturally* form modular subnetworks for certain tasks, and whether that modularity is something we can predict rather than discover after the fact. The corpus has a sharp anchor for the first part. Pruning and ablation experiments show that networks really do carve compositional tasks into isolated subnetworks — knock out one subnetwork and only its corresponding function breaks, leaving the rest intact Do neural networks naturally learn modular compositional structure?. The strongest predictor isn't the task in isolation but the training history: pretraining sharply increases how consistently and reliably this modular structure forms across architectures and domains. So a partial answer to 'can we predict?' is yes — compositional structure plus pretraining is the recipe that makes modularity likely.

But the corpus also plants a warning flag right next to that hope. A model can hold every linearly-decodable feature it needs for a task while its internal organization is fundamentally fractured — and standard accuracy metrics are blind to the difference Can models be smart without organized internal structure?. That means 'this task decomposed cleanly' and 'this model scores well' are *not* the same claim. The same skepticism shows up elsewhere: instruction tuning can look like task understanding while really just teaching output format Does instruction tuning teach task understanding or output format?, and chain-of-thought can imitate the *form* of decomposed reasoning while the underlying logic falls apart off-distribution Does chain-of-thought reasoning actually generalize beyond training data?. So predicting genuine modular decomposition means predicting internal structure, not surface behavior — and the surface lies.

Here's the turn you might not expect: much of the collection sidesteps the prediction problem by *imposing* decomposition instead of waiting for it to emerge. If you can't reliably forecast which tasks split themselves, you architect the split. Separating a 'decomposer' model from a 'solver' model beats a monolithic LLM, and intriguingly the decomposition skill transfers across domains while the solving skill does not Does separating planning from execution improve reasoning accuracy?. LLM Programs wrap models in explicit algorithms that feed each call only its step-relevant context, turning a tangled task into debuggable sub-tasks Can algorithms control LLM reasoning better than LLMs alone?. At the extreme, breaking a task into minimal voted micro-steps lets even small non-reasoning models run a million steps error-free Can extreme task decomposition enable reliable execution at million-step scale?.

That externally-imposed work gives a back-door answer to the prediction question. Function calling, for instance, was found to cleanly factor into seven recurring subtasks — nested calls, chaining, parallel functions, and so on — and training each explicitly generalized better than one umbrella dataset Can breaking function calling into subtasks improve model generalization?. Agents can even *discover* their own reusable sub-task routines from past experience and compound them hierarchically, with the payoff growing as the gap between training and test widens Can agents learn reusable sub-task routines from past experience?. The pattern across these: tasks with recurring, reusable substructure are the ones that decompose — and that recurrence is something you can often spot or mine, not just hope for.

The thing you may not have known you wanted to know: 'decomposability' isn't really a property of a task sitting in isolation. It's a three-way bet between the task's compositional structure, the training regime (pretraining makes natural modules reliable), and whether you're measuring real internal organization or just letting a confident accuracy number fool you. The honest frontier in the corpus is that we can *engineer* clean decomposition far more reliably than we can *predict* spontaneous decomposition — and telling the two apart requires looking inside the network, because the output won't tell you.


Sources 9 notes

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Does chain-of-thought reasoning actually generalize beyond training data?

DataAlchemy experiments show CoT fails systematically under distributional shifts in task, length, and format. Models produce fluent but logically inconsistent reasoning — imitating reasoning form without valid underlying logic.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Can algorithms control LLM reasoning better than LLMs alone?

LLM Programs embed LLMs within explicit algorithms that manage control flow and state, presenting only step-specific context to each LLM call. This information hiding addresses capability and context window limits while treating complex reasoning as modular, debuggable sub-tasks.

Can extreme task decomposition enable reliable execution at million-step scale?

MAKER solves million-step tasks with zero errors by decomposing into minimal subtasks, applying voting at each step, and flagging correlated errors. Surprisingly, small non-reasoning models suffice when decomposition is extreme enough, inverting the standard approach to hard problems.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Next inquiring lines