Agentic Systems and Planning Reasoning and Learning Architectures

Can skill documents be optimized like neural network weights?

Can natural-language skill documents be treated as trainable parameters and improved through iterative optimization with validation gating, similar to how model weights are tuned in deep learning?

Note · 2026-05-28 · sourced from Action Models

SkillOpt's move is to treat the skill document — a natural-language artifact packaging procedures, heuristics, tool policies, and failure modes — as the external state of a frozen agent, trainable with the same discipline that makes weight-space optimization reproducible. A separate optimizer model turns scored rollouts into structured add/delete/replace edits on a single document, and an edit is accepted only when it strictly improves a held-out validation score. The deep-learning analogy is operational: rollout batch size controls gradient noise, a textual learning rate controls step size, the held-out gate is validation, and an epoch-wise slow/meta update acts as momentum.

This matters because it makes procedural adaptation available for closed frontier models where weight tuning is impossible and prompts are brittle. The skill, not the weights, becomes the recurring object of adaptation — and crucially the deployed artifact (a compact 300–2,000 token best_skill.md) adds zero inference-time model calls, unlike methods that pay an optimization tax at deployment. Across six benchmarks, seven models, and three harnesses, SkillOpt is best-or-tied on all 52 cells and the learned skills transfer (a Codex-trained spreadsheet skill gains +59.7 points moving to Claude Code).

The counterpoint is that the analogy is partial — there is no true gradient, the optimizer is itself an LLM that can hallucinate edits, and "validation" is a held-out task split that can be gamed. But the held-out gate is precisely what disciplines this: harmful proposals are rejected rather than accumulated. Therefore the insight stands — skill text is a trainable parameter space, and the optimizer-plus-validation loop is what makes self-improvement reproducible rather than drift.


— "SkillOpt: Executive Strategy for Self-Evolving Agent Skills", https://arxiv.org/abs/2605.23904

Related concepts in this collection

Concept map
15 direct connections · 85 in 2-hop network ·medium cluster Open in graph ↗

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere
Original note title

the agent skill document can be trained like model weights using a text-space optimizer with held-out validation gating