SkillOpt: Executive Strategy for Self-Evolving Agent Skills - Summary

Summary (Overview)

Core Innovation: SkillOpt is the first systematic, controllable text-space optimizer for agent skills, treating the skill document as an external, trainable state for a frozen agent, analogous to weight-space optimization in deep learning.
Key Mechanism: A separate optimizer model converts scored execution rollouts into bounded add/delete/replace edits on a skill document. Edits are accepted only if they strictly improve a held-out validation score, ensuring stable, controlled optimization.
Empirical Dominance: Evaluated across 6 benchmarks, 7 target models, and 3 execution harnesses, SkillOpt is the best or tied-best method on all 52 evaluated (model, benchmark, harness) cells, significantly outperforming all baselines.
Strong Performance Gains: On GPT-5.5, it lifts the average no-skill accuracy by +23.5 points in direct chat, +24.8 points inside the Codex agentic loop, and +19.1 points inside Claude Code.
Transferable Artifacts: Optimized skills retain value when transferred across model scales, between different execution environments (Codex/Claude Code), and to nearby math benchmarks without further optimization.

Introduction and Theoretical Foundation

Agent skills are natural-language artifacts that package procedures, domain heuristics, and tool policies, allowing a frozen agent to adapt through external text. Current methods for skill creation—hand-crafting, one-shot generation, or loosely controlled self-revision—lack the discipline and reliability of a deep-learning optimizer.

The paper argues that skills should be trained as the external state of a frozen agent, applying optimization principles (batches, learning rates, validation) to the text space. The core analogy is operational:

Skill document ↔ Model parameters
Trajectory-derived edit direction ↔ Gradient
Edit budget ↔ Learning rate
Held-out selection gate ↔ Validation check

SkillOpt formalizes this by introducing a harness-agnostic optimizer with rollout batches, reflection minibatches, structured edits, textual learning rates, validation gating, rejected-edit buffers, and epoch-wise slow/meta updates. The output is a compact, reusable best_skill.md file that adds zero inference-time cost.

Methodology

The problem is defined with a frozen target model $M$ , a harness $h$ , a task $x$ , and a skill $s$ . Execution produces a trajectory $\tau$ and a scalar score $r$ :

(\tau(s), r(s)) = h(M, x, s), \quad r(s) \in [0, 1]

Given splits $D_{tr}, D_{sel}, D_{test}$ , the goal is to find the best skill $s^\star_{sel}$ on the selection split and report final performance on the test split:

s^\star_{sel} = \arg \max_{s \in \mathcal{C}(D_{tr})} \frac{1}{|D_{sel}|} \sum_{x \in D_{sel}} r(s)

\text{Test}(s^\star_{sel}) = \frac{1}{|D_{test}|} \sum_{x \in D_{test}} r(s^\star_{sel})

The SkillOpt pipeline consists of several key components:

Forward Pass (Rollout Evidence): The target model executes a rollout batch from $D_{tr}$ with the current skill, generating trajectories.
Backward Pass (Minibatch Reflection): The optimizer model analyzes successes and failures in minibatches, proposing structured add/delete/replace edits.
Bounded Text Updates: An edit budget $L_t$ (the textual learning rate) limits the number of edits applied per step. Edits are ranked and clipped.
Validation Gate & Rejected-Edit Buffer: Every candidate skill is evaluated on $D_{sel}$ . It is accepted only if it improves the selection score. Rejected edits become negative feedback for future updates.
Epoch-Wise Slow/Meta Update: At epoch boundaries, the optimizer compares performance across epochs and writes longitudinal guidance into a protected section of the skill, capturing durable lessons.

The method is harness-agnostic through a lightweight adapter interface.

Empirical Validation / Results

The evaluation answers four questions across six benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), seven target models (GPT-5.5, 5.4, 5.4-mini, 5.4-nano, 5.2, Qwen3.5-4B, Qwen3.6-35B-A3B), and three execution modes (direct chat, Codex harness, Claude Code harness).

Main Results

Table 1 presents the comprehensive results. Key findings:

SkillOpt is best or tied-best on all 52 evaluated cells.
On GPT-5.5 direct chat, it achieves an average gain of +23.5 points over no skill, lifting scores from 58.8 to 82.3.
It outperforms an oracle that picks the best per-cell baseline (human, LLM, Trace2Skill, TextGrad, GEPA, EvoSkill) by +5.4 points on average.
Gains are substantial across model scales, with smaller models (e.g., GPT-5.4-nano) showing the largest relative improvements (e.g., ALFWorld 34.3 → 69.4).
The method is equally effective inside tool-backed execution loops (Codex, Claude Code).

Ablations and Analysis

Table 2 & 3 and Figure 3 analyze design choices:

Gains are relatively insensitive to exact batch sizes but require sufficient evidence.
Bounded textual learning rates are crucial. The "without lr" ablation performs worse.
The validation gate, rejected-edit buffer, and epoch-wise slow/meta update are critical stabilizers. Removing slow/meta update dropped SpreadsheetBench performance by -22.5 points.
Performance evolves stably across epochs, with validation selection aligning with test generalization.

Transfer Experiments

Table 4 demonstrates the portability of optimized skill artifacts:

Cross-Model Transfer: Skills trained on a source model (e.g., GPT-5.4) provide positive gains when deployed on smaller target models (e.g., GPT-5.4-mini, nano).
Cross-Harness Transfer: A SpreadsheetBench skill trained in Codex transfers to Claude Code with a +59.7 point gain.
Cross-Benchmark Transfer: A skill trained on OlympiadBench provides positive gains on the related Omni-MATH benchmark.

Optimizer Strength and Skill Characteristics

Table 5 shows that a stronger frontier optimizer (GPT-5.5) yields larger gains, but a target-matched optimizer still recovers 56–74% of the gain, confirming the loop's intrinsic value.

Table 6 and Figure 4 characterize the learned artifacts:

Compact: Final skills range from 379 to 1,995 tokens.
Edit-Economical: Gains come from only 1–4 accepted edits per benchmark.
Procedural & Inspectable: Learned rules encode generalizable procedures (e.g., "Inspect workbook structure and formulas, then write evaluated static values...").

Theoretical and Practical Implications

Skill as a Trainable State: The paper establishes a new paradigm where the skill document itself is the primary object of optimization, enabling controlled, reproducible adaptation for frozen agents.
Harness-Agnostic Optimization: By separating the optimizer from the execution harness, SkillOpt provides a general-purpose adaptation layer applicable across diverse agent environments.
Cost-Effective Deployment: The one-time training cost produces a compact, static artifact (best_skill.md) that adds zero overhead at deployment and can be audited, edited, and reused.
Empirical Superiority: The comprehensive results demonstrate that systematic, validation-gated text-space optimization outperforms existing methods for skill/prompt improvement, especially on procedural tasks.

Conclusion

SkillOpt introduces a controlled text-space optimization loop for agent skills, applying deep-learning-style discipline (batches, learning rates, validation) to a natural-language artifact. It delivers state-of-the-art performance across a broad range of models, benchmarks, and execution environments. The resulting skills are compact, interpretable, and transferable, positioning them as a practical, weight-free domain adaptation layer for frontier agents.

Future directions include extending to skill libraries, preference-driven gates for open-ended tasks, and self-distillation of skills back into model weights.