Can modular expert decomposition extend beyond time into other causal dimensions?

This explores whether the trick behind time-sliced expert models — carving a model into specialists along the time axis and routing causally between them — generalizes to slicing along other causal or structural dimensions, not just time. The starting point is TiMoE Can routing mask future experts to prevent knowledge leakage?, which trains separate experts on disjoint two-year windows and masks any expert whose window postdates the query. The key insight there isn't really about time — it's that you can make a *causal constraint* (no peeking at the future) into an *architectural* property rather than a retrieval band-aid. Once you frame it that way, time is just one axis along which the decomposition happens to run.

The corpus suggests the answer is yes, and that the axis of decomposition can be functional rather than temporal. The cleanest evidence is that separating *what kind of work* a module does — not *when* its knowledge applies — already outperforms monolithic models. Splitting a reasoner into a decomposer that plans and a solver that executes Does separating planning from execution improve reasoning accuracy? improves accuracy, and notably the planning skill transfers across domains while the solving skill doesn't — a sign that these are genuinely distinct causal faculties worth isolating. Forecasting gets decomposed the same way, into contextualization, macro/micro outlook, and synthesis stages Can decomposing forecasting into stages unlock numerical and contextual reasoning?, specifically so that numerical extrapolation and event-driven contextual reasoning don't have to share one set of weights.

There's also evidence that this modular structure is something networks *want* to do on their own. Pruning experiments show neural networks spontaneously implement compositional subroutines in isolated subnetworks Do neural networks naturally learn modular compositional structure?, with ablations cleanly knocking out one function at a time — and pretraining makes that modularity more reliable. So the question isn't really whether decomposition along non-temporal axes is *possible*; it's whether you route between the modules deliberately (as TiMoE does) or let them emerge. Transformer² Can models dynamically activate expert skills at inference time? sits in between: it composes task-specific expert vectors dynamically at inference, mixing skill-shaped experts on the fly rather than pre-slicing by a fixed dimension.

Where the analogy gets interesting is whether "causal dimension" means the *causal structure of the world* rather than just causal *ordering* of information. Here the corpus offers a different and arguably deeper form of decomposition: pulling the formal causal model out of the LLM entirely. Causal Reflection Can separating causal models from language models improve reasoning? and the scientist-and-subject SCM approach Can structural causal models automate social science with language models? both relegate the language model to translation and inference while a separate symbolic structure carries the causal logic. That's modular decomposition along the most fundamental axis of all — separating *what causes what* from *how to say it* — and it's motivated by the finding that LLMs reason causally better than temporally Why do LLMs handle causal reasoning better than temporal reasoning? yet still inherit human causal biases like Markov violations Do large language models make the same causal reasoning mistakes as humans?.

The honest caveat the corpus raises: decomposition only buys you as much as the axis you chose captures. Causal belief networks, for all their structural auditability Can we extract causal belief networks from interview conversations?, still can't represent associative, analogical, or emotion-driven reasoning Can causal models alone capture how humans actually reason?. So yes — modular expert decomposition extends well beyond time, into functional, compositional, and formally causal dimensions. But each slicing axis encodes a theory of what the important boundaries are, and a clean cut along one dimension leaves everything orthogonal to it uncaptured.

Sources 11 notes

Can routing mask future experts to prevent knowledge leakage?

TiMoE pre-trains experts on disjoint two-year slices and masks experts whose windows postdate the query, cutting future-knowledge errors by ~15% while guaranteeing strict causal validity. This shows temporal grounding can be an architectural property, not just a retrieval patch.

Does separating planning from execution improve reasoning accuracy?

Modular architectures with separate decomposer and solver models outperform monolithic LLMs, with decomposition ability transferring across domains while solving ability does not. The separation prevents planning-execution interference and produces more generalizable skills.

Can decomposing forecasting into stages unlock numerical and contextual reasoning?

Nexus outperforms pure TSFM and LLM baselines on real-world datasets by decomposing forecasting into contextualization, dual-resolution macro/micro outlook, and synthesis stages. Separating numerical extrapolation from event-driven contextual reasoning avoids forcing one model to handle both simultaneously.

Do neural networks naturally learn modular compositional structure?

Pruning experiments reveal that neural networks implement compositional subroutines in isolated subnetworks, with ablations affecting only their corresponding function. Pretraining substantially increases the consistency and reliability of this modular structure across architectures and domains.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can separating causal models from language models improve reasoning?

Causal Reflection separates causal reasoning into a formal dynamic model with a Reflect mechanism for revision, relegating the LLM to structured inference and language rendering. This architecture sidesteps asking LLMs to perform causal reasoning directly, addressing both spurious-correlation failures and RL's explanation gap.

Can structural causal models automate social science with language models?

LLMs guided by structural causal models can propose and test causal hypotheses across negotiation, bail, interview, and auction scenarios. Simulations reveal effect directions reliably but not magnitudes, making them useful for directional social science.

Why do LLMs handle causal reasoning better than temporal reasoning?

ChatGPT excels at causal relations but struggles with temporal ordering because causal connectives are explicit and frequent in training data, while temporal order is often implicit and must be inferred contextually.

Do large language models make the same causal reasoning mistakes as humans?

LLMs show weak explaining away and Markov violations in collider networks, matching human error patterns exactly. This suggests shared mechanisms rooted in training data statistics rather than categorical reasoning inferiority.

Can we extract causal belief networks from interview conversations?

A three-step pipeline—extracting causal motifs from QA, composing belief graphs, and applying do-calculus interventions—successfully models how individuals update beliefs in response to hypothetical policy changes. The approach provides structural auditability that opaque persona prompting cannot.

Can causal models alone capture how humans actually reason?

Causal belief networks excel at modeling causal reasoning but cannot represent associative links, analogical mappings, or emotion-driven belief shifts. The GenMinds framework itself acknowledges this as a tractable starting point rather than a complete theory.

Can modular expert decomposition extend beyond time into other causal dimensions?

Sources 11 notes

Next inquiring lines