Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning

Summary (Overview)

Unified Co-evolution Framework: Skill1 is a framework that trains a single policy to co-evolve three core capabilities of skill-augmented agents—skill selection, utilization, and distillation—simultaneously towards a shared task-outcome objective.
Single-Signal Credit Assignment: It derives all learning signals from a single binary task-success reward $r(\tau) \in \{0, 1\}$ , decomposing it into a low-frequency trend (for skill selection) and high-frequency variation (for skill distillation), eliminating the need for auxiliary or conflicting reward sources.
State-of-the-Art Performance: Skill1 achieves a 97.5% average success rate on ALFWorld and strong performance on WebShop, outperforming prior skill-based and reinforcement learning baselines.
Mutually Reinforcing Dynamics: Empirical analysis confirms that the three capabilities improve simultaneously under the unified signal, and ablations show that removing any stage's credit signal degrades all capabilities, demonstrating their mutual dependence.
Efficient Library Management: The framework includes mechanisms for skill library maintenance (admission, retirement) and distillation, which compresses experience into concise skills, controlling computational overhead and improving library quality.

Introduction and Theoretical Foundation

Training Large Language Model (LLM) agents via Reinforcement Learning (RL) typically treats each task as an isolated episode, where successful strategies are only implicitly absorbed into policy parameters. A promising solution is to augment agents with a persistent skill library that accumulates reusable strategies, allowing the agent to draw on past successes. The workflow of such skill-augmented agents involves three coupled stages:

Skill Selection: Choosing a relevant skill from the library for the current task.
Skill Utilization: Executing the task while being guided by the selected skill.
Skill Distillation: Deriving new reusable skills from the experience.

Existing methods often optimize these capabilities in isolation or with separate reward sources, leading to partial evolution and conflicting optimization pressures. This paper addresses two fundamental open questions:

How can an agent evolve all three capabilities simultaneously?
How can they co-evolve toward a shared objective?

The authors propose Skill1, a framework that achieves unified evolution by training a single policy to perform and optimize all three stages. The key innovation is credit assignment on a single task-outcome signal $r(\tau)$ , decomposing it to provide targeted learning for each capability without auxiliary models.

The problem is formulated as a Partially Observable Markov Decision Process (POMDP) $\mathcal{M} = (\mathcal{S}, \mathcal{A}, \mathcal{O}, T, \Omega, R, \gamma)$ . A state $S = (x, e, \mathcal{B})$ includes a task instruction $x$ , environment state $e$ , and a persistent skill library $\mathcal{B} = \{s_1, s_2, \dots\}$ . A skill $s \in \mathcal{B}$ consists of a natural-language strategy s.strat (how to act) and a scenario description s.desc (when it applies). The overall training objective is:

\max_{\theta} \mathbb{E}_{x \sim \mathcal{D}, \tau \sim \pi_\theta(\cdot|x)} \left[ r(\tau) \right],

where $\pi_\theta$ is the policy optimized with RL algorithms like GRPO.

Methodology

Skill1 trains a single policy $\pi_\theta$ to perform a sequential three-stage workflow for each task $x \sim \mathcal{D}$ , generating a complete trajectory $\tau = (q, z, a_1, o_1, \dots, a_T, o_T, s_{\text{new}})$ .

3.1 Agent Workflow

Skill Selection:
- Query Generation: The policy generates a natural-language query $q \sim \pi_\theta(\cdot | x)$ .
- Retrieval: A frozen encoder $E$ retrieves the top- $K$ candidates by semantic similarity: $\mathcal{B}_K = \text{top-}K_{s \in \mathcal{B}} \text{sim}\left( E(q), E(s.\text{desc}) \right).$
- Re-ranking: The policy re-ranks these candidates by generating a permutation $\sigma \sim \pi_\theta(\cdot | x, \mathcal{B}_K)$ , and the top-ranked skill $z$ is selected for utilization.
Skill Utilization: The policy interacts with the environment for up to $T$ turns conditioned on the selected skill's strategy: $\tau \sim \pi_\theta(\cdot | x, z.\text{strat}, o_{\leq t})$ . For each task, $G$ independent rollouts are sampled.
Skill Distillation: After each rollout, the policy reflects on the trajectory to produce a new skill $s_{\text{new}}$ :
- A reusable strategy $s_{\text{new}}.\text{strat} \sim \pi_\theta(\cdot | x, \tau)$ .
- A scenario description $s_{\text{new}}.\text{desc} \sim \pi_\theta(\cdot | x, \tau)$ . A new skill is admitted to $\mathcal{B}$ only if $r(\tau) = 1$ . When the library reaches capacity $|\mathcal{B}| = N_{\max}$ , the skill with the lowest retirement score $U(s) \cdot \log n(s)$ is removed, where $n(s)$ is the selection count.

3.2 Reward Assignment

The core challenge is assigning credit from the single task-outcome $r(\tau)$ to capabilities operating at different temporal scopes (episode, cross-episode, library improvement).

Crediting Utilization: The outcome directly measures execution quality:
$R_i^{\text{util}} = r(\tau_i).$
Crediting Selection:
- The query $q$ receives gradients through the utilization objective.
- For re-ranking, a per-skill utility score $U(s)$ is maintained as a low-frequency trend, updated via exponential moving average after each rollout: $U(s) \leftarrow (1 - \alpha) \cdot U(s) + \alpha \cdot r(\tau_i), \quad \forall s \in \mathcal{B}_K^i.$ The best available utility $\hat{U}_i = \max_{s \in \mathcal{B}_K^i} U(s)$ serves as the library baseline.
- The trend supervises re-ranking by rewarding permutations $\sigma_i$ that agree with the utility ordering, using Normalized Discounted Cumulative Gain (NDCG): $R_i^{\text{rerank}} = \text{NDCG}\left( \sigma_i, \text{argsort}(-[U(\mathcal{B}_K^i)]) \right).$
Crediting Distillation: The credit is the high-frequency variation of the current outcome relative to the library's trend, approximating whether the new skill improves future performance:
$R_i^{\text{distill}} = r(\tau_i) - \hat{U}_i.$
A positive variation indicates the experience surpasses the current library boundary.

3.3 Joint Optimization

Each rollout $\tau_i$ concatenates four generation segments (query $q_i$ , permutation $\sigma_i$ , actions $a_{1:T}$ , distilled skill $s_{\text{new},i}$ ). They are optimized jointly in a single gradient step using GRPO, with a combined objective:

J(\theta) = J^{\text{util}}(\theta) + \lambda_1 J^{\text{rerank}}(\theta) + \lambda_2 J^{\text{distill}}(\theta).

Utilization & Query: $J^{\text{util}}(\theta)$ uses group-relative advantages computed from $R_i^{\text{util}}$ .
Re-ranking: Optimized independently per rollout with a REINFORCE-style objective: $J^{\text{rerank}}(\theta) = \frac{1}{N \cdot G} \sum_i R_i^{\text{rerank}} \cdot \log \pi_\theta(\sigma_i | x_i, \mathcal{B}_K^i).$
Distillation: $J^{\text{distill}}(\theta)$ uses group-relative advantages computed from $R_i^{\text{distill}}$ .

The full procedure is summarized in Algorithm 1.

Empirical Validation / Results

Experimental Setup: Evaluated on ALFWorld (text-based household environment) and WebShop (online shopping simulator). The base policy is Qwen2.5-7B-Instruct, trained with GRPO ( $G=16$ ). The skill library is initialized empty with a capacity of 5,000.

4.2 Main Results

Table 1: Main results on ALFWorld and WebShop (Success Rate, %). Bold denotes best results; ↑ indicates improvement over the previous best. “Avg.” stands for average success rate and “Succ.“ stands for success rate.

Method	Pick	Look	Clean	Heat	Cool	Pick2	Avg.	WebShop Score	WebShop Succ.
w/o Training
Zero-Shot	33.4	21.6	19.3	6.9	2.8	3.2	14.8	26.4	7.8
ReAct	48.5	35.4	34.3	13.2	18.2	17.6	31.2	46.2	19.5
Reflexion	62.0	41.6	44.9	30.9	36.3	23.8	42.7	58.1	28.8
RL-Trained w/o Skills
PPO	92.3	64.0	92.5	89.5	80.3	68.8	80.4	81.4	68.7
GRPO	90.8	66.1	89.3	74.7	72.5	64.7	77.6	79.3	66.1
GiGPO	97.7	82.7	98.8	83.7	89.3	79.2	90.8	84.4	72.8
RL-Trained w/ Skills
SkillRL	97.9	71.4	90.0	90.0	95.5	87.5	89.9	85.2	72.7
RetroAgent	97.9	90.9	99.2	92.9	85.3	91.0	94.9	88.9	82.3
Skill1 (Ours)	100.0↑2.1	98.6↑7.7	97.3	99.2↑6.3	96.1↑0.6	96.0↑5.0	97.5↑2.6	89.7	82.9

Skill1 achieves a 97.5% average success rate on ALFWorld, surpassing the previous best (RetroAgent, 94.9%) by 2.6 points and ranking first on 5 out of 6 task types. It also demonstrates the best performance on WebShop.
Skill1 outperforms the strongest RL-only method (GiGPO, 90.8%) by 6.7 points, demonstrating the value of an explicit, reusable skill library.
Performance increases with the degree of co-evolution, as Skill1 optimizes all three stages while prior methods leave at least one unoptimized or use separate rewards.

4.3 Analysis

Ablation Study (Table 2):

Table 2: Ablation study on ALFWorld (Success Rate %). Upper block ablates workflow components; lower block ablates training objectives.

Method	Pick	Look	Clean	Heat	Cool	Pick2	Avg.
Skill1	100.0	98.6	97.3	99.2	96.1	96.0	97.5
w/o Selection	96.9	90.3	98.0	90.4	86.5	85.3	91.8
w/o Distillation	97.4	88.5	98.1	96.1	87.6	89.5	92.4
w/o Library	96.7	71.5	94.9	70.7	71.5	65.5	80.9
w/ $\lambda_1=0$	99.5	80.5	98.8	100.0	90.6	84.9	94.0
w/ $\lambda_2=0$	100.0	85.4	95.5	96.4	91.0	96.2	94.9
w/ $\lambda_1=\lambda_2=0$	98.1	74.9	95.6	95.6	79.5	87.2	90.2

The skill library is foundational; removing it causes the largest drop (16.6 points).
Removing distillation or selection also significantly harms performance (5.1 and 5.7 point drops, respectively).
The auxiliary objectives ( $\lambda_1, \lambda_2$ ) are complementary; removing both causes a sharper decline than removing each individually.

Co-evolution Dynamics:

Under unified training, the three capabilities (selection precision, utilization success rate, distillation positive rate) exhibit mutual reinforcement and converge simultaneously.
Ablating any credit signal slows all three capabilities, evidencing their mutual dependence.

Evolution of Skill Management:

The policy learns to generate increasingly precise selection queries (task-skill similarity improves from 0.51 to 0.60).
The library ceiling $\hat{U}$ rises (from ~0.5 to 0.91), indicating the policy distills increasingly effective skills due to the variation signal.

Skill Library Diversity:

Skill1 activates a broader set of skills more frequently and covers a diverse strategy space, unlike ablations where skill usage concentrates on a few popular skills.

Computational Overhead (Table 3):

Table276: Computational cost on ALFWorld training. We report wall-clock time per step (seconds) and library size (number of skills) at three checkpoints.

Method	Time / Step (s)			Library Size
	Step 20	Step 60	Step 100	Step 20	Step 60	Step 100
GRPO (no library)	301.3	274.1	296.7	—	—	—
SkillRL	368.1	319.0	326.6	60	71	83
Skill1	386.6	444.3	493.8	915	3,899	5,000
w/o Distill. Step	508.8	750.1	738.4	2,212	5,000	5,000

Skill1 adds moderate overhead (1.3-1.7x slower than GRPO) due to the growing library context.
Distillation is crucial for controlling cost; without it, raw trajectories inflate the library, making selection slower (69% slower by step 60).

Theoretical and Practical Implications

Theoretical: Skill1 provides a principled framework for unified credit assignment in multi-stage, skill-based RL. It demonstrates how a single objective signal can be decomposed to guide optimization across different temporal scopes (episodic, cross-episodic, library-level).
Practical: The framework enables the development of more sample-efficient and capable LLM agents that can autonomously build and leverage a knowledge base of reusable strategies. It reduces the need for manually engineering separate reward functions or training pipelines for different agent capabilities.
Empirical: The results establish a new state-of-the-art for skill-augmented agents on benchmark environments and provide clear evidence that co-evolving all stages of the skill lifecycle is superior to optimizing them in isolation.

Conclusion

Skill1 presents a framework for the unified evolution of skill-augmented agents by training a single policy to co-evolve skill selection, utilization, and distillation toward a shared task-outcome objective. By decomposing the reward signal into low-frequency trend and high-frequency variation components, it achieves targeted credit assignment without auxiliary rewards. Experiments confirm significant performance gains, coupled evolution of capabilities, and the necessity of each credit signal.

Limitations:

Environment Coverage: Evaluation is limited to two text-based environments; generalization to visual or deeper search environments is unexplored.
Scalability: The fixed-size skill library (5,000 entries) may become a bottleneck for highly diverse tasks, requiring more sophisticated eviction or hierarchical organization strategies.
Safety & Oversight: Autonomous skill accumulation