MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation
Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.
Large language model (LLM) agents are increasingly tasked with solving complex, real-world problems that involve interacting with external tools, data, and code, often spanning many steps and disparate domains. As task scope grows, raw model reasoning alone is insufficient: agents need access to reusable units of capability, namely skills, that encapsulate procedures, executable code, or domain-specific instructions and can be composed into solutions. Skills are emerging as the natural abstraction for scalable agent systems because they decouple capability from monolithic model weights, enabling modular execution and the accumulation of structured domain knowledge. The central open question is how to enable agents to continuously improve their capabilities through skills they can obtain, organize, and refine on their own, without relying on human authoring at every step.
We argue that skills should not be one-off generation outputs but long-lived, evolving assets of an agent system. A useful skill is created on demand within the agent's reasoning loop, stored with associated experience and metadata, retrieved when contextually relevant, validated through tests and runtime feedback, and continuously refined as new evidence accumulates. We formalize this perspective as a unified skill lifecycle with five stages: creation, memory, management, evaluation, and refinement. This reframing turns skills from disposable artifacts into managed, testable, and transferable infrastructure: the foundation needed for agents to accumulate experience across tasks, sessions, and even across different agent systems. We instantiate this lifecycle in MUSE-Autoskill Agent, which tightly couples skill creation with execution through a built-in skill_create tool invoked from within the runtime loop, eliminating the creation–usage mismatch.
We present a skill-centric agent framework that enables agents to improve their task-solving capability by acquiring, reusing, and refining skills through a unified lifecycle. By representing all functionality as structured skills, the agent reduces redundant reasoning and improves efficiency over time. Our design integrates skill creation, evaluation, execution, memory, and management, supported by minimal built-in skills such as skill_create and web_search. Experiments on SkillsBench demonstrate that human skills reliably improve task accuracy, that MUSE-Autoskill can automatically generate high-quality skills from successful trajectories (reaching 87.94% on tasks with generated skills), and that generated skills transfer to other agents with minimal accuracy loss. The framework is already deployed in production systems that expose skill creation, discovery, and lifecycle management to real users. Overall, this work provides a scalable approach toward agents with continuously evolving capabilities.