SkillOpt: Executive Strategy for Self-Evolving Agent Skills - Summary
Summary (Overview)
- Core Innovation: SkillOpt is the first systematic, controllable text-space optimizer for agent skills, treating the skill document as an external, trainable state for a frozen agent, analogous to weight-space optimization in deep learning.
- Key Mechanism: A separate optimizer model converts scored execution rollouts into bounded add/delete/replace edits on a skill document. Edits are accepted only if they strictly improve a held-out validation score, ensuring stable, controlled optimization.
- Empirical Dominance: Evaluated across 6 benchmarks, 7 target models, and 3 execution harnesses, SkillOpt is the best or tied-best method on all 52 evaluated (model, benchmark, harness) cells, significantly outperforming all baselines.
- Strong Performance Gains: On GPT-5.5, it lifts the average no-skill accuracy by +23.5 points in direct chat, +24.8 points inside the Codex agentic loop, and +19.1 points inside Claude Code.
- Transferable Artifacts: Optimized skills retain value when transferred across model scales, between different execution environments (Codex/Claude Code), and to nearby math benchmarks without further optimization.
Introduction and Theoretical Foundation
Agent skills are natural-language artifacts that package procedures, domain heuristics, and tool policies, allowing a frozen agent to adapt through external text. Current methods for skill creation—hand-crafting, one-shot generation, or loosely controlled self-revision—lack the discipline and reliability of a deep-learning optimizer.
The paper argues that skills should be trained as the external state of a frozen agent, applying optimization principles (batches, learning rates, validation) to the text space. The core analogy is operational:
- Skill document ↔ Model parameters
- Trajectory-derived edit direction ↔ Gradient
- Edit budget ↔ Learning rate
- Held-out selection gate ↔ Validation check
SkillOpt formalizes this by introducing a harness-agnostic optimizer with rollout batches, reflection minibatches, structured edits, textual learning rates, validation gating, rejected-edit buffers, and epoch-wise slow/meta updates. The output is a compact, reusable best_skill.md file that adds zero inference-time cost.
Methodology
The problem is defined with a frozen target model , a harness , a task , and a skill . Execution produces a trajectory and a scalar score :
Given splits , the goal is to find the best skill on the selection split and report final performance on the test split:
The SkillOpt pipeline consists of several key components:
- Forward Pass (Rollout Evidence): The target model executes a rollout batch from with the current skill, generating trajectories.
- Backward Pass (Minibatch Reflection): The optimizer model analyzes successes and failures in minibatches, proposing structured
add/delete/replaceedits. - Bounded Text Updates: An edit budget (the textual learning rate) limits the number of edits applied per step. Edits are ranked and clipped.
- Validation Gate & Rejected-Edit Buffer: Every candidate skill is evaluated on . It is accepted only if it improves the selection score. Rejected edits become negative feedback for future updates.
- Epoch-Wise Slow/Meta Update: At epoch boundaries, the optimizer compares performance across epochs and writes longitudinal guidance into a protected section of the skill, capturing durable lessons.
The method is harness-agnostic through a lightweight adapter interface.
Empirical Validation / Results
The evaluation answers four questions across six benchmarks (SearchQA, SpreadsheetBench, OfficeQA, DocVQA, LiveMathematicianBench, ALFWorld), seven target models (GPT-5.5, 5.4, 5.4-mini, 5.4-nano, 5.2, Qwen3.5-4B, Qwen3.6-35B-A3B), and three execution modes (direct chat, Codex harness, Claude Code harness).
Main Results
Table 1 presents the comprehensive results. Key findings:
- SkillOpt is best or tied-best on all 52 evaluated cells.
- On GPT-5.5 direct chat, it achieves an average gain of +23.5 points over no skill, lifting scores from 58.8 to 82.3.
- It outperforms an oracle that picks the best per-cell baseline (human, LLM, Trace2Skill, TextGrad, GEPA, EvoSkill) by +5.4 points on average.
- Gains are substantial across model scales, with smaller models (e.g., GPT-5.4-nano) showing the largest relative improvements (e.g., ALFWorld 34.3 → 69.4).
- The method is equally effective inside tool-backed execution loops (Codex, Claude Code).
Ablations and Analysis
Table 2 & 3 and Figure 3 analyze design choices:
- Gains are relatively insensitive to exact batch sizes but require sufficient evidence.
- Bounded textual learning rates are crucial. The "without lr" ablation performs worse.
- The validation gate, rejected-edit buffer, and epoch-wise slow/meta update are critical stabilizers. Removing slow/meta update dropped SpreadsheetBench performance by -22.5 points.
- Performance evolves stably across epochs, with validation selection aligning with test generalization.
Transfer Experiments
Table 4 demonstrates the portability of optimized skill artifacts:
- Cross-Model Transfer: Skills trained on a source model (e.g., GPT-5.4) provide positive gains when deployed on smaller target models (e.g., GPT-5.4-mini, nano).
- Cross-Harness Transfer: A SpreadsheetBench skill trained in Codex transfers to Claude Code with a +59.7 point gain.
- Cross-Benchmark Transfer: A skill trained on OlympiadBench provides positive gains on the related Omni-MATH benchmark.
Optimizer Strength and Skill Characteristics
Table 5 shows that a stronger frontier optimizer (GPT-5.5) yields larger gains, but a target-matched optimizer still recovers 56–74% of the gain, confirming the loop's intrinsic value.
Table 6 and Figure 4 characterize the learned artifacts:
- Compact: Final skills range from 379 to 1,995 tokens.
- Edit-Economical: Gains come from only 1–4 accepted edits per benchmark.
- Procedural & Inspectable: Learned rules encode generalizable procedures (e.g., "Inspect workbook structure and formulas, then write evaluated static values...").
Theoretical and Practical Implications
- Skill as a Trainable State: The paper establishes a new paradigm where the skill document itself is the primary object of optimization, enabling controlled, reproducible adaptation for frozen agents.
- Harness-Agnostic Optimization: By separating the optimizer from the execution harness, SkillOpt provides a general-purpose adaptation layer applicable across diverse agent environments.
- Cost-Effective Deployment: The one-time training cost produces a compact, static artifact (
best_skill.md) that adds zero overhead at deployment and can be audited, edited, and reused. - Empirical Superiority: The comprehensive results demonstrate that systematic, validation-gated text-space optimization outperforms existing methods for skill/prompt improvement, especially on procedural tasks.
Conclusion
SkillOpt introduces a controlled text-space optimization loop for agent skills, applying deep-learning-style discipline (batches, learning rates, validation) to a natural-language artifact. It delivers state-of-the-art performance across a broad range of models, benchmarks, and execution environments. The resulting skills are compact, interpretable, and transferable, positioning them as a practical, weight-free domain adaptation layer for frontier agents.
Future directions include extending to skill libraries, preference-driven gates for open-ended tasks, and self-distillation of skills back into model weights.