Why does Branch-Train-Merge fail without learned routing between experts?

This explores why Branch-Train-Merge — training expert models separately and stitching them together — depends on a *learned router* to pick which expert handles each token, and what breaks when you skip that routing step.

This explores why Branch-Train-Merge — training expert models separately and stitching them together — depends on a learned router to pick which expert handles each token, and what breaks when you skip that routing step. The short version from the corpus: parallel expert training is the easy part; the hard part is *selection*, and a static merge with no router throws away the one thing that makes the experts worth training separately.

The most direct evidence is Branch-Train-MiX itself Can asynchronous expert training beat synchronized distributed LLM training?. It trains domain experts in parallel with no synchronization, then merges their feed-forward layers into a Mixture-of-Experts block — but the crucial move is that it *also learns token-level routing* over those merged experts. The note explicitly reports that this beats both synchronized training and "routing-free merging." That last phrase is your answer in miniature: when you merge expert parameters without a router, you're effectively averaging or naively combining specialists into a generalist mush, and you lose the per-token ability to say "this token is code, send it to the code expert." The experts were only valuable *because* they diverged; a routing-free merge collapses that divergence back into an average.

Why selection matters this much shows up again and again under different vocabulary. Test-time ensembling routes each query to the best-fit model by semantic cluster and beats a single frontier model — ten small models *with routing* outran GPT-4.1, which the note frames as "selection is a stronger lever than scaling" Can routing beat building one better model?. RAG sees the same thing: StructRAG trains a router (via DPO) to send each query to the knowledge structure that fits it, and uniform retrieval without that choice underperforms Can routing queries to task-matched structures improve RAG reasoning?. The recurring lesson is that heterogeneous specialists only pay off if something *decides who answers* — remove the deciding step and the heterogeneity becomes noise.

There's also a structural reason routing can't be an afterthought: routing carries information that the merged weights alone can't encode. TiMoE trains experts on disjoint time slices and uses *causal routing* to mask experts whose knowledge postdates the query — without that routing logic the merged model would happily leak future knowledge Can routing mask future experts to prevent knowledge leakage?. And Engram's U-shaped scaling result shows the routing/allocation balance is itself the lever: get the split between lookup and computation wrong and the hybrid underperforms either pure mechanism, even at identical parameters and FLOPs Can lookup memory and computation work together better than either alone?. Routing isn't plumbing between experts — it's where the conditional computation lives.

The deeper takeaway you might not have come looking for: "merge the experts" and "route to the experts" are not two implementations of the same idea. A merge is a one-time, query-blind averaging of parameters; routing is a per-input decision that preserves specialization at inference time. Branch-Train-Merge without learned routing fails because it does the cheap half of the recipe (parallel training) and skips the half that actually exploits what that training produced.

Sources 5 notes

Can asynchronous expert training beat synchronized distributed LLM training?

Branch-Train-MiX trains domain experts in parallel without synchronization overhead, merges their feed-forward parameters as MoE experts, and learns token-level routing, achieving better accuracy-efficiency tradeoffs than synchronized training or routing-free merging.

Can routing beat building one better model?

Avengers-Pro achieves 7% higher accuracy than GPT-5-medium by routing queries to optimal models per semantic cluster, or matches its performance at 27% lower cost. Ten 7B models with routing previously surpassed GPT-4.1 and 4.5, suggesting selection is a stronger lever than scaling.

Can routing queries to task-matched structures improve RAG reasoning?

StructRAG demonstrates that selecting knowledge structure type based on query demands—via DPO-trained router choosing among tables, graphs, algorithms, catalogues, and chunks—improves knowledge-intensive reasoning over standard retrieval. The approach grounds this in cognitive load and cognitive fit theory from cognitive science.

Can routing mask future experts to prevent knowledge leakage?

TiMoE pre-trains experts on disjoint two-year slices and masks experts whose windows postdate the query, cutting future-knowledge errors by ~15% while guaranteeing strict causal validity. This shows temporal grounding can be an architectural property, not just a retrieval patch.

Can lookup memory and computation work together better than either alone?

Engram combines O(1) N-gram lookup with Mixture-of-Experts routing, revealing a U-shaped scaling law where balanced allocation to both mechanisms outperforms either alone. Gains appear largest in reasoning and code rather than pure retrieval.

Why does Branch-Train-Merge fail without learned routing between experts?

Sources 5 notes

Next inquiring lines