How should skill libraries coordinate with gradient-based weight optimization?

This explores whether a library of editable skill documents (text-based capabilities) and traditional weight training (gradient descent on model parameters) are rivals or collaborators — and what the corpus says about where each one belongs.

This explores whether skill libraries and gradient-based weight optimization should compete or divide labor — and the corpus suggests the most interesting answer is that they optimize different *substrates* for different reasons. The cleanest bridge is SkillOpt Can skill documents be optimized like neural network weights?, which shows a skill document can be improved the same way you train weights — a separate optimizer proposes edits and keeps only the ones that beat a held-out validation score — except the thing being tuned is readable text, not a parameter matrix. That reframes the coordination question: skills give you a transparent, transferable layer (the same document helped across models) while weights stay fixed underneath.

Why not just put everything in the weights? Because the corpus is unusually candid about what gradient training quietly costs. Direct fine-tuning corrupts knowledge stored in lower layers, whereas decoding-time proxy-tuning leaves base weights untouched and preserves that knowledge better Can decoding-time tuning preserve knowledge better than weight fine-tuning?. RL post-training collapses the model onto a single dominant pretraining format within the first epoch, suppressing alternatives — and the winner is chosen by scale, not by which format is actually best Does RL training collapse format diversity in pretrained models?. Even when RL appears to teach new procedures, out-of-distribution tests show it mostly sharpens memorized templates rather than installing reasoning Do fine-tuned language models actually learn optimization procedures?. So a sensible coordination rule emerges: let gradients do what only gradients can, and don't ask them to carry brittle, easily-corrupted, or fast-changing knowledge that a skill document holds more safely.

The richer pattern is composition at inference rather than baking-in. Transformer² tunes only the singular values of weight matrices to produce expert vectors that mix dynamically at inference without interfering with each other Can models dynamically activate expert skills at inference time? — a weight-space analogue of pulling the right skill off a shelf at runtime. Even more striking, swarms of model 'particles' can search weight space with no gradients at all and discover composed experts that answer questions every starting expert failed — using only 200 validation examples Can language models discover new expertise through collaborative weight search?. Both point the same way as SkillOpt: a small validation set, not a giant gradient run, becomes the coordinator that decides which capability survives.

There's a deeper reason to keep some capability outside the weights: models are bad at exactly the kind of iterative refinement that gradient optimization assumes. LLMs can't actually execute iterative numerical methods in latent space — they pattern-match a memorized answer and emit plausible-but-wrong values Do large language models actually perform iterative optimization?. A skill document can encode an explicit procedure (do this, then check, then repeat) that the weights will never reliably internalize on their own. This echoes how function-calling improves when split into seven explicit subtasks instead of one umbrella objective Can breaking function calling into subtasks improve model generalization? — granularity that lives naturally in a skill library.

If you do still want to train, the corpus offers guardrails for the gradient side so the two layers don't fight. Training order matters: structured tasks first protects open-ended creativity from entropy collapse Does training order reshape how models handle different task types?. Binary correctness rewards quietly wreck calibration unless you add a proper scoring term Does binary reward training hurt model calibration?. And you can derive dense training signal without human labels by letting tree search rank solution paths Can tree search replace human feedback in LLM training?. The thread tying all of this together: the coordinator in every case is a cheap, inspectable validation or search signal — and the question isn't 'skills *or* weights' but 'which layer should each capability live in so it stays transparent, transferable, and uncorrupted.'

Sources 11 notes

Can skill documents be optimized like neural network weights?

SkillOpt demonstrates that skill documents can be systematically improved through a separate optimizer that proposes edits, accepting only changes that strictly improve held-out validation scores. This approach outperforms baselines across 52 experimental cells and produces skills that transfer between models.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Do fine-tuned language models actually learn optimization procedures?

Even GRPO-trained models show sharp performance drops on out-of-distribution variants (N-1 test sets) compared to in-distribution problems, indicating RL optimizes template-matching rather than genuine problem-solving procedures.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can language models discover new expertise through collaborative weight search?

PSO-inspired swarms of LLM particles moving through weight space discover composed experts with new capabilities—including answering questions all initial experts failed on—using only 200 validation examples and no gradient-based training.

Do large language models actually perform iterative optimization?

Research shows LLMs cannot perform iterative procedures in latent space. They recognize optimization problems as template-similar and emit plausible-looking but incorrect values, a failure mode that persists across model scale and training approaches.

Can breaking function calling into subtasks improve model generalization?

Granite-20B-FunctionCalling shows that explicit training across seven granular subtasks—nested calls, chaining, parallel functions, name detection, parameter detection, next-best function, and response generation—generalizes better than umbrella datasets like ToolLLM. This multi-task approach closes the performance gap with GPT, Claude, and Gemini.

Does training order reshape how models handle different task types?

Omni-Thinker shows structured domains decrease output entropy while creative domains increase it. BWT-guided scheduling—training structured tasks first—yields 6.2% gains over joint training by preventing entropy collapse from damaging open-ended capabilities.

Does binary reward training hurt model calibration?

Binary correctness rewards incentivize high-confidence guessing because they don't penalize confident wrong answers. Adding the Brier score as a second reward term mathematically guarantees joint optimization of accuracy and calibration without trade-off.

Can tree search replace human feedback in LLM training?

AlphaLLM uses tree search outcomes and three critic models to derive dense reward signals equivalent to human-labeled feedback. Tree structure naturally ranks solution paths by success, replacing the annotation oracle that standard RLHF requires.

How should skill libraries coordinate with gradient-based weight optimization?

Sources 11 notes

Next inquiring lines