SkillClaw: Let Skills Evolve Collectively with Agentic Evolver

Summary (Overview)

Collective Skill Evolution: Proposes SkillClaw, a framework for continuous, collective evolution of reusable skills in multi-user LLM agent ecosystems by aggregating interaction trajectories across users.
Agentic Evolver: Employs an autonomous LLM-based evolver that analyzes aggregated session evidence to perform open-ended reasoning and decide on skill updates via refinement, creation, or skipping.
Closed-Loop System: Establishes an automated, background loop: Interaction → Evidence Aggregation → Agentic Evolution → Validation → Skill Synchronization, requiring no user intervention.
Empirical Gains: Demonstrates significant performance improvements on WildClawBench, with relative gains up to +88.41% across diverse task categories (e.g., Social Interaction, Creative Synthesis) after multiple evolution rounds.
Case Study Insights: Skill evolution improves tasks by correcting procedural errors (e.g., API ports), structuring workflows, introducing robustness checks (e.g., file validation), and enabling stricter constraint verification.

Introduction and Theoretical Foundation

Large Language Model (LLM) agents like OpenClaw rely on reusable skills—structured procedures for tool interaction and task solving—as core building blocks. However, current skill ecosystems are largely static; skills are manually installed and maintained, and solutions discovered during user interactions do not persist beyond individual sessions. This leads to a critical inefficiency: similar workflows, tool usage patterns, and failure modes are repeatedly rediscovered across different users, preventing the system from accumulating knowledge and improving with collective experience.

The fundamental challenge is converting heterogeneous, cross-user experiences into reliable, generalized skill updates. Existing approaches are insufficient: memory-based methods store instance-specific trajectories but struggle to generalize; skill-based methods treat libraries as static resources; and local refinements remain isolated. SkillClaw addresses this by introducing a framework for collective skill evolution. It treats cross-user and over-time interactions as the primary signal for improvement, continuously aggregating trajectories and processing them with an autonomous agentic evolver to update a shared skill repository.

The formal goal is, given a shared skill set $S = \{ s_1, \dots, s_M \}$ and a set of user session trajectories $T = \{ \tau_i \}$ , to produce an updated set:

S' = \Phi(S, T)

such that improvements benefit all future users.

Methodology

SkillClaw operates through a centralized evolution architecture integrated into a multi-user agent deployment. The core pipeline is a closed loop:

Multi-user Interaction → Session Collection → Skill Evolution → Skill Synchronization

1. From Isolated Sessions to Shared Evidence

Session Recording: Each agent interaction produces a structured session trajectory $\tau$ that preserves the full causal chain: prompt → action → feedback → ... → agent response. This includes tool calls, errors, and user feedback, which are critical for diagnosing procedural failures.
Evidence Aggregation: Sessions are uploaded to a central repository and grouped by referenced skills. For a skill $s$ , its group is: $G(s) = \{ \tau_i | s \in K_i \}$ Sessions using no skill form a separate group $G(\emptyset)$ . This grouping enables cross-user comparison, revealing where a skill works or fails under diverse conditions.

2. Agentic Skill Evolution

The core component is an agentic evolver—an LLM agent equipped with a structured harness providing the grouped evidence and current skill definitions. For each skill group $G(s)$ , the evolver analyzes both successful and failed executions and selects one of three actions:

Refine: Update the skill to correct errors or improve robustness.
Create: Introduce a new skill when evidence reveals recurring, uncaptured sub-procedures.
Skip: Leave the skill unchanged if evidence is insufficient.

The evolver reasons jointly over successes (defining invariants to preserve) and failures (defining targets to correct), ensuring updates are cumulative and stable. The overall process is outlined in Algorithm 1.

3. Skill Synchronization and Validation Loop

Candidate skill updates undergo validation in real user environments before deployment. The system executes both the original and updated skill on relevant tasks and compares outcomes. Updates are Accepted only if they demonstrate better performance, inducing monotonic improvement. Accepted skills are merged into the shared repository and synchronized to all agents, forming a continuous evolution loop.

Key Properties:

Collective Evolution: Knowledge from individual interactions contributes to a shared, continuously improving ecosystem.
Full Automation: The entire pipeline runs without manual curation, driven solely by normal user interaction.
Agentic Adaptability: Updates are produced through open-ended reasoning, not predefined rules, enabling handling of novel failure modes.

Empirical Validation / Results

Evaluation was conducted on WildClawBench, a real-world agent benchmark with 60 complex tasks across six domains (see Table 1), executed in full Linux containers with multimodal inputs and fine-grained evaluation (see Table 2).

Experimental Setup: A 6-day simulation with 8 concurrent users. Each day had a daytime interaction phase (generating sessions) and a nighttime evolution/validation phase. The backbone model was Qwen3-Max. Only validated skill improvements were deployed.

Main Results (User-side Performance): Performance improved consistently across categories, with gains consolidating into a stable, best skill pool for daytime deployment.

Category	Day 1	Day 2	Day 3	Day 4	Day 5	Day 6	Abs. Gain	Rel. Gain
Social Interaction	54.01%	60.34%	60.34%	60.34%	60.34%	60.34%	+6.33	+11.72%
Search & Retrieval	22.73%	30.00%	30.00%	34.55%	34.55%	34.55%	+11.82	+52.00%
Creative Synthesis	11.57%	21.80%	21.80%	21.80%	21.80%	21.80%	+10.23	+88.41%
Safety & Alignment	24.00%	24.00%	24.00%	24.00%	32.00%	32.00%	+8.00	+33.33%

Table 3: User-side daytime results showing performance gains over 6 evolution rounds.

Analysis of Evolution Patterns (Tables 4-7): Skill evolution followed distinct, category-specific trajectories:

Social Interaction: Early, sharp improvement from refining workflow explicitness and execution reliability (e.g., rewriting a summarization skill into strict procedural steps).
Search & Retrieval: Staged improvement, first resolving input/file validation, then advancing to constraint-aware retrieval planning.
Creative Synthesis: Large early jump from fixing environment setup (workspace validation), then plateauing as later multimodal pipeline skills did not surpass the early best pool.
Safety & Alignment: Later improvement focused on execution reliability under real-world constraints (e.g., Git authentication fallbacks, correct cloning procedures).

Controlled Validation: A controlled test on three custom queries isolating common failure modes showed an average gain of +42.1% after a single evolution round, confirming the mechanism's effectiveness for procedural corrections.

Query	Baseline (%)	Post-Evolve (%)	Gain
basic extraction	21.7%	69.6%	+47.8%
deadline parsing	41.1%	48.0%	+6.9%
save report	28.3%	100.0%	+71.7%
Average	30.4%	72.5%	+42.1%

Table or text: Controlled validation results on custom queries.

Case Studies (Figures 2-5): Illustrate how evolution concretely improves agent behavior:

Slack Analysis (Fig 2): Evolved skill corrected API port, added selective full-message retrieval, and specified output path, transforming a naive, error-prone workflow into a structured, reliable pipeline.
ICCV Paper Analysis (Fig 3): Evolved skill introduced a strict "first affiliation" definition and targeted manual verification, replacing heuristic name-matching with accurate, robust counting.
SAM3 Inference (Fig 4): Evolved skill added environment prechecks, treated missing paths as non-blocking, and enabled adaptive execution (e.g., CPU patching), improving robustness under incomplete conditions.
Product Selection (Fig 5): Evolved skill enforced structured, constraint-aware verification against official sources and calibrated decision-making, preventing early stopping on partial matches.

Theoretical and Practical Implications

Theoretical Implications: SkillClaw represents a conceptual shift from static skill libraries to dynamic, interaction-driven skill ecosystems. It demonstrates that skills can and should evolve through real-world usage, leveraging aggregated cross-user experience as a powerful signal for system-level capability growth. The agentic evolution paradigm bridges the gap between isolated interaction-level improvements and collective, cumulative learning.

Practical Implications:

For Deployed Agent Systems: Provides a scalable, automated pathway for continuous improvement without requiring user intervention or manual engineering. Improvements discovered by one user automatically benefit the entire community.
For Robustness and Reliability: Evolution directly targets recurring procedural failures and environmental mismatches, leading to more robust and reliable agents in real-world, imperfect conditions.
For Skill Design: Highlights the importance of designing skills as editable, evidence-compressing artifacts rather than fixed instructions. The framework's validation loop ensures deployment stability.

Conclusion

SkillClaw enables collective skill evolution in multi-user agent ecosystems by transforming ordinary interaction trajectories into shared evidence and employing an agentic evolver for updates. This establishes a continuous loop where interaction drives skill improvement, and improved skills enhance future interactions. The framework is fully automatic, collective, and adaptive, demonstrating significant performance gains in realistic benchmarks. This work motivates future research on self-improving agent systems that leverage cross-user experience to achieve continuous, cumulative capability growth.