What makes two timescales better than one for minimizing weight movement?

This reads 'two timescales' as splitting a system into a slow-changing substrate and a fast, cheap adjustment layer, and 'minimizing weight movement' in both its senses — physically shuttling weights through memory, and disturbing learned weights during training — and asks why that split tends to win.

This explores why separating a slow component from a fast one beats one all-purpose mechanism when the goal is to move weights as little as possible. The corpus doesn't contain a paper literally on two-timescale optimization, but a consistent pattern runs through it: leave the expensive, slow-changing weights where they are, and do the real work in a cheaper, faster layer on top. The clearest hardware version is on-device inference, where the bottleneck isn't computing — it's hauling weights across memory. MobileLLM shows that recomputing the same transformer block twice costs less latency than fetching a second block's weights, so sharing weights between adjacent blocks gains accuracy with zero extra parameters Does recomputing weights cost less than moving them on mobile?. The slow substrate (the stored weights) stays put; the fast loop (recomputation) absorbs the work.

The same logic reappears in tuning, where 'weight movement' means corrupting what the base model already knows. Proxy-tuning never touches the base weights at all — it applies the alignment shift at decoding time and closes 88–91% of the gap while actually beating direct fine-tuning on knowledge tasks, because direct fine-tuning damages knowledge stored in the lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?. Two timescales again: a frozen slow store of knowledge, plus a fast distributional nudge that only affects reasoning and style. Core-parameter isolation makes the split explicit inside the weights themselves — freeze the task-critical core regions, and only geometrically merge the non-core remainder. Tellingly, that paper found scheduling tasks over time was *not* enough on its own; you need the structural separation, not just a temporal one Can isolating task-specific parameters prevent multi-task fine-tuning interference?.

There's a subtler reason the two-layer split helps: a single objective forced to do two jobs does both worse. Utility-weighted training is supposed to make a model both learn good features and make good decisions, but asymmetric loss strengthens the choosing while starving the gradient signal that builds representations — so training with plain symmetric loss and *then* adjusting predictions afterward beats the fused approach on its own utility metric Can utility-weighted training loss actually harm model performance?. Splitting 'learn slowly, decide fast' outperforms collapsing them into one update.

The flip side worth knowing: separation isn't always about two mechanisms — sometimes one quantity is rich enough to act at two levels at once. DRO reuses a single variance statistic as both a token-level weight and a query-level filter, getting 2–3× faster training from one signal doing double duty Can one statistical measure serve dual purposes in RL training?. So the real principle isn't 'always add a second timescale' — it's 'match the number of timescales to the number of distinct jobs.' When moving the slow weights is the costly part — in memory or in knowledge — a fast cheap layer that leaves them alone is what buys you the savings.

Sources 5 notes

Does recomputing weights cost less than moving them on mobile?

MobileLLM shows that on memory-bound mobile hardware, sharing weights between adjacent transformer blocks by recomputing one block twice uses less latency than fetching separate weights, gaining accuracy with no parameter increase.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Can isolating task-specific parameters prevent multi-task fine-tuning interference?

Research shows that identifying core parameter regions per task, clustering overlapping tasks, and freezing core parameters while geometrically merging non-core parameters consistently outperforms standard multi-task fine-tuning. Temporal task scheduling alone proves insufficient without explicit structural parameter isolation.

Can utility-weighted training loss actually harm model performance?

Asymmetric loss functions correctly incentivize choosing but degrade representation learning by reducing gradient signals for substantive feature acquisition. Training with symmetric loss then adjusting predictions post-hoc outperforms direct utility-weighted training on the same utility objective.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

What makes two timescales better than one for minimizing weight movement?

Sources 5 notes

Next inquiring lines