SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

Summary (Overview)

Framework: SkillsVote is a comprehensive lifecycle governance framework for LLM agent skills, managing the entire process from open-source skill collection, profiling, and task-conditioned recommendation to post-execution outcome attribution and evidence-gated library evolution.
Core Mechanism: It introduces a subtask-level attribution layer to decompose raw agent trajectories into evolvable units, assigning outcomes to skill use, agent exploration, or environment. Only successful, reusable discoveries are admitted to update the skill library.
Key Results: The framework improves agent performance without model updates. Offline evolution (distilling a library from historical tasks) improved GPT-5.2 on Terminal-Bench 2.0 by up to +7.9 percentage points. Online evolution improved GPT-5.2 on SWE-Bench Pro by up to +2.6 percentage points.
Governance Benefit: Task-conditioned recommendation acts as a critical filter, preventing harmful skill exposure and negative transfer, especially in early online regimes where indiscriminate library exposure can degrade performance.
Corpus Scale: The system is built upon a million-scale open-source Agent Skill corpus collected from GitHub, which is profiled for runtime requirements, quality, and verifiability to serve as a governed experience substrate.

Introduction and Theoretical Foundation

Recent progress in LLM agents has shifted focus from single-turn generation to long-horizon systems that operate in complex environments like terminals, codebases, and web applications. These agents produce lengthy execution trajectories that contain potentially reusable experience. However, raw trajectories are a weak substrate for long-term reuse: they are noisy, environment-bound, and conflate robust strategies with incidental state.

Agent Skills provide a more structured schema, coupling executable scripts with non-executable procedural guidance in a single artifact (e.g., a SKILL.md file within a directory package). This makes experience more compact than a full trajectory while preserving more context than a natural-language summary.

The core challenge shifts from authoring individual skills to governing open ecosystems. Public skill ecosystems exhibit scale, redundancy, uneven quality, and safety risks. Furthermore, the benefit of a skill is context-dependent; weakly related or low-quality skills can degrade performance. A key failure mode is library pollution, where indiscriminate incorporation of weakly supported lessons harms future agent performance.

SkillsVote addresses this by framing skills as managed lifecycle artifacts. It connects collection, profiling, recommendation, attribution, and evolution into a closed, auditable loop to control two coupled risks: irrelevant skills distracting agents before execution, and misattributed experience polluting the library after execution.

Methodology

SkillsVote's approach consists of four main stages: corpus profiling, recommendation, attribution, and controlled evolution.

1. Open-Source Skill Corpus and Profiling

Collection: Builds a million-scale corpus from GitHub SKILL.md files, treating each skill as a directory-level package (containing SKILL.md, optional scripts/, references/, assets/).
Profiling: Each skill is profiled along three dimensions:
1. Runtime Requirements: Estimates OS assumptions, permissions, network access, required tools, API keys, etc.
2. Quality: Evaluates consistency, completeness, and task orientation.
3. Verifiability: Assesses if the skill has a low-ambiguity success condition and a reproducible sandbox environment.
Task Synthesis: For verifiable skills, the system synthesizes executable benchmark tasks (following the Harbor format) to link static descriptions to observed execution behavior.

2. Skill Recommendation via Agentic Library Search

Before the solver agent begins a task, a separate recommendation stage performs an agentic search over the structured skill library. This goes beyond static semantic matching or progressive disclosure of metadata.

The recommender agent "searches the local skill library, selectively reads candidate SKILL.md files and related resources, and selects skills that cover the task, fit the target environment, and provide complementary guidance."

The output is a compact set of skill names and a concise usage guide for the solver agent, not the full library. This pre-task exposure control is crucial for filtering noise and preventing negative transfer.

3. Distilling Execution Traces into Evolvable Units (Attribution)

A granularity gap exists between task-level success signals (too coarse) and individual tool calls (too fragmented) for effective skill evolution. SkillsVote inserts a subtask-level attribution layer.

A subtask is defined as the smallest semantically complete unit that can support library evolution. It has:

One standalone objective.
One primary evaluation signal (environment, human, or unknown).
At most one associated skill context.

Trajectories are split when any of these three boundaries changes. For each subtask, attribution compresses evidence along three axes:

Outcome Evidence: Records the type of judgment signal.
Responsibility Assignment: Assigns the final state and its main cause using a detailed attribution taxonomy (see Table 4).
Reusable Delta: Localizes the portions of skill knowledge actually used and extracts only reusable discoveries (missing procedures, preconditions, recovery patterns).

4. Evidence-Based Controlled Skill Evolution

The attribution layer produces evolvable units, but library evolution requires explicit control. SkillsVote uses evidence-gated update construction:

Admissibility: A unit is admissible only if it is successful and contains reusable exploration. Failed or uncertain evidence is skipped for evolution (but may be kept for diagnosis).
Aggregation: Admissible units supporting the same reusable change are merged into a single proposed update.
Routing: Aggregated evidence is routed to an update action:
- Edit Skill (error_fix, knowledge_addition, prerequisite_addition): If evidence extends a skill that shaped execution.
- Create Skill: If evidence reflects an independent reusable capability outside the current skill boundary.
- Skip: If evidence is weak, redundant, or misaligned.

This process ensures evolution is conservative; every library change must be supported by attributed evidence, localized to the relevant skill, and expressed as reusable procedural knowledge.

Empirical Validation / Results

The framework was evaluated on Terminal-Bench 2.0 (89 terminal tasks) and SWE-Bench Pro public (731 software-engineering tasks), using Codex with GPT-5.2 or GPT-5.4 mini backbones. Experiments assessed three control points: offline evolution, online evolution, and recommendation.

Main Performance Results

Table 1: Main results on Terminal-Bench 2.0 (avg@5 Accuracy)

Model / Setting	Overall (89)	Easy (4)	Medium (55)	Hard (30)
GPT-5.2
Medium (no-skill)	51.0	75.0	54.9	40.7
Online	53.7 ↑2.7	75.0	62.9 ↑8.0	34.0 ↓6.7
Offline	58.9 ↑7.9	90.0 ↑15.0	65.1 ↑10.2	43.3 ↑2.7
GPT-5.4 mini
Medium (no-skill)	51.7	75.0	61.8	30.0
Online	52.8 ↑1.1	75.0	63.6 ↑1.8	30.0
Offline	57.5 ↑5.8	65.0 ↓10.0	64.7 ↑2.9	43.3 ↑13.3

Table III: Main results on SWE-Bench Pro public (avg@1 Resolve Rate)

Model / Setting	Overall (731)	ansib. (96)	openl. (91)	quteb. (79)	flipt (85)	telep. (76)	vuls (62)	navid. (57)	webcl. (65)	eleme. (56)	nodeb. (44)	tutan. (20)
GPT-5.2
Medium (no-skill)	47.6	49.0	64.8	62.0	32.9	34.2	54.8	49.1	43.1	50.0	47.7	0.0
Online	50.2 ↑2.6	56.2	63.7	68.4	32.9	35.5	56.5	45.6	38.5	50.0	72.7	0.0
GPT-5.4 mini
Medium (no-skill)	46.9	52.1	55.0	64.6	31.8	35.5	50.0	50.9	38.5	46.4	61.4	0.0
Online	49.0 ↑2.1	51.0	59.3	68.4	32.9	38.2	56.5	49.1	38.5	51.8	61.4	0.0

Offline Evolution: Provides the strongest gains. A library distilled from 48 Terminal-Bench Pro tasks transferred well to unseen Terminal-Bench 2.0 tasks, improving GPT-5.2 by +7.9 pp and GPT-5.4 mini by +5.8 pp.
Online Evolution: Yields positive but smaller gains. Starting from an empty library, it improved SWE-Bench Pro performance by +2.6 pp (GPT-5.2) and +2.1 pp (GPT-5.4 mini).

Analysis of Key Mechanisms

Recommendation as a Noise Filter: Ablation studies on the Terminal-Bench 2.0 Hard subset show that without recommendation, exposing the online library directly leads to more negative task-level deltas than positive ones (mean contribution: +3.3 / -6.7). With recommendation, the negative effect is balanced (+6.0 / -6.0). For the offline library, recommendation increases the mean positive contribution from +11.3 to +15.3 and reduces loss.
Offline Evolution Accumulates Transferable Procedures: The offline library growth (Figure 6) shows both new skill creation and edits to existing skills. Performance on the source benchmark (Terminal-Bench Pro) fluctuates, but the transfer performance to Terminal-Bench 2.0 Hard improves consistently, indicating the library accumulates reusable operational procedures rather than overfitting source-task artifacts.
Case Study of Transfer: Figure 7 presents a representative case. A skill (ubuntu-apache-vhost) evolved from an Apache website task in Terminal-Bench Pro distilled knowledge about persistent service setup and end-to-end validation. On an unseen Git-server deployment task in Terminal-Bench 2.0, the agent using this skill reused the operational pattern (deploy with stable Apache, connect Git hook, validate via curl), while the baseline built a less reliable lightweight Node server without validation.

Theoretical and Practical Implications

Lifecycle View of Agent Skills: SkillsVote demonstrates the necessity of treating skill ecosystems as managed artifacts requiring coupled governance across collection, recommendation, attribution, and evolution. This lifecycle view is critical for scalable and safe experience reuse.
Attribution as a Critical Layer: The subtask-level attribution framework provides a principled method for credit assignment in long-horizon agent trajectories. It bridges the gap between sparse task-level signals and fragmented step-level data, enabling the distillation of reusable procedural knowledge.
Conservative Evolution for Ecosystem Health: The evidence-gated, conservative update policy directly addresses the risk of library pollution. By requiring that updates be attributable, successful, and reusable, the system ensures the skill library's quality improves over time.
Improving Frozen Agents: The results show that governed external skill libraries can significantly improve the performance of frozen agent models without any model parameter updates. This positions skill libraries as a practical and efficient substrate for agent improvement.
Recommendation is Essential: The analysis proves that simply exposing a skill library to an agent is not neutral and can be harmful. Task-conditioned recommendation is a necessary control point to filter out noise and enable positive skill transfer.

Conclusion

SkillsVote presents a comprehensive lifecycle framework for governing Agent Skills. By constructing and profiling a million-scale open-source corpus, performing agentic library search for recommendation, introducing a subtask-level attribution layer, and enforcing evidence-gated controlled evolution, it turns execution traces into conservative updates for persistent skill libraries.

The experimental results validate that this approach can improve frozen LLM agents on challenging terminal and software-engineering benchmarks. Offline evolution demonstrates strong transferable library building, while online evolution shows positive cumulative learning. Crucially, the framework highlights that governance—controlling skill exposure before execution and evidence admission after execution—is essential for harnessing the potential of open skill ecosystems without succumbing to their risks of noise and pollution.

Future directions include extending the framework to broader task domains, studying multi-agent skill evolution, and further refining the attribution models to handle more ambiguous outcome signals.