Can extracted skills transfer effectively across different domains and model architectures?

This explores whether skills pulled out of one model's experience — workflows, rules, expert vectors — actually carry over to new task domains and different model backbones, or whether they're locked to where they were learned.

This explores whether extracted skills travel — across task domains and across different model architectures — and the corpus says the answer is a qualified yes, but the qualifier matters a lot. The most direct evidence comes from work where skills are stored as natural language rather than weights. When a frozen model extracts explicit rules from its context into a reusable "skill" library, those skills lift performance without any weight update and, crucially, transfer across model backbones Can frozen models learn better by extracting context into skills?. Because the skill is just text describing a procedure, nothing ties it to one model's parameters. Similarly, agent workflow memory induces sub-task routines at a finer grain than whole tasks and abstracts away example-specific values — and the gains grow precisely as the gap between training and test widens, which is exactly the signature of something that generalizes rather than memorizes Can agents learn reusable sub-task routines from past experience?.

The strongest cross-architecture claim comes from decoupling who learns the skills from who runs them. When a separately trained curator evolves a skill repository while the executor stays frozen, the repository drifts away from generic verbose notes toward actionable execution logic and cross-task meta-strategies — and the trained curator generalizes across different executor backbones and domains Can a separate trained curator improve skill libraries better than frozen agents?. That separation is the design trick: keep the skill representation portable (text, routines, strategy) and you sidestep the architecture-binding problem entirely.

The contrast worth noticing is what happens when skills live in the weights instead. Composable expert vectors work by tuning only the singular values of weight matrices, letting a model mix task-specific experts at inference without interference Can models dynamically activate expert skills at inference time?. That's elegant composition — but it's composition within one model's own weight space, not transfer to a different architecture. The same tension shows up in how domains are taught: knowledge-graph curricula build deep, compositional domain expertise Can knowledge graphs teach models deep domain expertise?, yet a survey of adaptation methods finds every technique has a domain-conditional sweet spot, and visible performance gains often come paired with hidden degradation in reasoning faithfulness and capability transfer How do domain training techniques actually reshape model behavior?. So weight-baked skills can specialize beautifully and still fail to travel.

The sharpest reason text-form skills transfer where weight edits don't: prompting and instruction can only reorganize knowledge that's already present. Prompt optimization retrieves existing capability but cannot inject knowledge a model never had Can prompt optimization teach models knowledge they lack?, and instruction tuning largely teaches the output format rather than the underlying task — semantically empty instructions perform about as well as correct ones Does instruction tuning teach task understanding or output format?. Read together, these explain the boundary condition for transfer: an extracted skill ports cleanly when it's activating capability the receiving model already latently has. Hand a skill to a model lacking the foundational knowledge it presumes, and you hit a hard ceiling no amount of clever transfer can cross.

What you might not have expected to learn: transferability isn't really a property of the skill — it's a property of the *representation* you store it in. Skills written as portable language (rules, routines, meta-strategies) cross both domains and architectures; skills fused into weights compose powerfully but stay home. And even the portable ones only fire when the destination model is already capable enough to use them — which reframes "can skills transfer" into the more useful question of "transfer to whom."

Sources 8 notes

Can frozen models learn better by extracting context into skills?

Extracting natural-language rules from context into reusable skills improves frozen model reasoning without weight updates. On CL-bench, this lifts GPT-4.1 from 11.1% to 16.5%, with skills transferable across model backbones.

Can agents learn reusable sub-task routines from past experience?

Agent Workflow Memory induces sub-task routines at finer granularity than full tasks, abstracts example-specific values, and compounds them hierarchically. This produces 24.6% relative gain on Mind2Web and 51.1% on WebArena, with larger gains as train-test gaps widen.

Can a separate trained curator improve skill libraries better than frozen agents?

SkillOS shows that separating a trainable curator from a frozen executor, grouped by task streams, causes skill repositories to shift from generic verbose additions toward actionable execution logic and cross-task meta-strategies. The trained curator generalizes across different executor backbones and domains.

Can models dynamically activate expert skills at inference time?

Transformer2 demonstrates that tuning only singular values within weight matrices produces composable expert vectors that dynamically mix at inference without interference, outperforming LoRA with fewer parameters and enabling continual specialization.

Can knowledge graphs teach models deep domain expertise?

Fine-tuning a 32B model on 24,000 reasoning tasks derived from medical knowledge graph paths produces state-of-the-art performance across 15 medical domains, demonstrating that structured knowledge composition matters more than scale.

How do domain training techniques actually reshape model behavior?

Research shows every adaptation method—from parameter-efficient tuning to knowledge graph curricula—has optimal conditions tied to specific domains. The key finding: visible benefits like performance gains often come with hidden degradation in reasoning faithfulness, capability transfer, and format flexibility.

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Does instruction tuning teach task understanding or output format?

Models trained on semantically empty or deliberately incorrect instructions achieve comparable performance to those trained on full correct instructions, achieving 43% vs random baseline 42.6%. The semantic content of instructions appears largely irrelevant; what transfers is knowledge of the output space.

Can extracted skills transfer effectively across different domains and model architectures?

Sources 8 notes

Next inquiring lines