Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Summary (Overview)

Holistic Skill Evolution: Introduces Trace2Skill, a framework that distills lessons from a diverse pool of agent execution trajectories into a single, comprehensive, and declarative skill document. This mirrors human expert methodology by analyzing broad experience before consolidation, contrasting with sequential online updates.
Parallel, Conflict-Free Consolidation: Employs a parallel fleet of sub-agents to analyze trajectories and propose skill patches. These patches are then hierarchically merged via inductive reasoning into a unified skill update, ensuring conflict resolution and generalization beyond specific trajectories.
Transferable and Generalizable Skills: Demonstrates that skills evolved from one model's trajectories significantly improve performance across different LLM scales (e.g., Qwen3.5-35B to Qwen3.5-122B) and out-of-distribution (OOD) task domains (e.g., from spreadsheet editing to Wikipedia table QA).
Effective Across Domains: Validates the framework's effectiveness in challenging domains including spreadsheet manipulation, mathematical reasoning, and visual question answering, showing consistent performance gains.
Open-Source and Efficient: Shows robust skill evolution using open-source models as small as 35B parameters, without requiring parameter updates or external retrieval modules at inference time.

Introduction and Theoretical Foundation

Equipping LLM agents with domain-specific skills is crucial for complex tasks, but manual skill authoring creates a scalability bottleneck. Automated skill generation often yields poor results, either relying on shallow parametric knowledge or sequentially overfitting to lessons from individual trajectories, leading to fragmented skill collections.

Existing online skill evolution paradigms diverge from human expert methodology in two key ways: 1) Skill Fragmentation vs. Consolidation: They often create new, narrow skills per lesson, whereas humans craft a single comprehensive guide per domain. 2) Sequential vs. Holistic Updates: Skills are updated reactively per incoming trajectory, whereas humans build broad domain understanding before authoring.

Motivated by these observations, Trace2Skill is designed to simulate the human, holistic approach. Instead of sequential updates, it analyzes a wide range of trajectory-local lessons in parallel and distills common patterns into a single, comprehensive agent skill. The core hypothesis is that inductive reasoning over a broad set of experience can extract generalizable Standard Operating Procedures (SOPs) that transfer across models and tasks, challenging the assumption that experience is inherently model- and task-specific.

Methodology

Trace2Skill operates via a three-stage pipeline, formalizing the skill evolution problem.

2.1. Skill and Problem Formalization

A skill $S$ is a structured knowledge directory:

S = (M, R), \quad R = \{scripts, references, assets\}

where $M$ (e.g., SKILL.md) encodes procedural knowledge in natural language, and $R$ provides auxiliary resources.

Let $\pi_\theta$ denote an LLM-based agent with fixed parameters $\theta$ , equipped with a skill $S$ . The success rate on a task set $D$ is:

P(S; \pi_\theta, D) = \frac{1}{|D|} \sum_{t \in D} \mathbb{1}[\pi_\theta(t; S) = y^*_t]

where $y^*_t$ is the ground-truth answer. The objective is to construct an improved skill $S^*$ from trajectories on an evolving set $D_{evolve}$ , without updating $\theta$ , such that:

S^* = \mathcal{E}(S_0, D_{evolve}; \pi_\theta), \quad P(S^*; \pi_\theta, D_{test}) > P(S_0; \pi_\theta, D_{test})

Two initializations for $S_0$ are studied: a human-expert-written skill (Deepening mode) and an LLM-generated draft from parametric knowledge alone (Creation mode).

2.2. Stage 1: Trajectory Generation

An agent $\pi_\theta$ (using a ReAct harness) runs in parallel on each task $t_i \in D_{evolve}$ with an initial skill $S_0$ , producing a trajectory $\tau_i$ :

\tau_i = \pi_\theta(q_i; S_0) = \langle q_i, (r^{(i)}_1, a^{(i)}_1, o^{(i)}_1), ..., (r^{(i)}_{T_i}, a^{(i)}_{T_i}, o^{(i)}_{T_i}), y_i \rangle

where $r^{(i)}_k$ is reasoning, $a^{(i)}_k$ is a tool call, $o^{(i)}_k$ is an observation, and $y_i \in \{0,1\}$ is correctness. The corpus $T = \{\tau_1, ..., \tau_N\}$ is partitioned into failures $T^-$ and successes $T^+$ .

2.3. Stage 2: Parallel Multi-Agent Patch Proposal

A fleet of specialized analyst sub-agents, dispatched concurrently, independently propose edits (skill patches) based on single trajectories.

Success Analyst ( $A^+$ ): A single-pass workflow that identifies generalizable behavior patterns from successful trajectories.
Error Analyst ( $A^-$ ): A ReAct-style multi-turn agentic loop that iteratively diagnoses root causes of failures by inspecting traces and artifacts, proposing a patch only after verifying the causal analysis.

Each analyst outputs a skill patch $p_i$ :

\mathcal{A}^-(S_0, \tau_i), & \tau_i \in T^- \\ \mathcal{A}^+(S_0, \tau_i), & \tau_i \in T^+ \end{cases}$$ All analysts operate on a frozen copy of $S_0$ with no inter-agent visibility, preserving diversity. ### 2.4. Stage 3: Conflict-Free Patch Consolidation The full patch pool $P = P^- \cup P^+$ is consolidated into a single coherent update $p^*$ via hierarchical merging with programmatic conflict prevention. Patches are merged in $L = \lceil \log_{B_{merge}} |P| \rceil$ levels. At each level $\ell$, groups of up to $B_{merge}$ patches are synthesized:

p^{(\ell+1)} = \mathcal{M}\left(\pi_\theta, S_0, {p^{(\ell)}1, ..., p^{(\ell)}{B_{merge}}}\right)

The merge operator $\mathcal{M}$ deduplicates, resolves conflicts, and preserves unique insights. Crucially, it performs **inductive reasoning**: identifying prevalent patterns (edits appearing consistently across independent patches) as more likely to generalize, while discarding idiosyncratic ones. The final $p^*$ is applied programmatically to $S_0$ to produce the evolved skill $S^*$. ## Empirical Validation / Results Experiments were conducted in spreadsheet, math reasoning, and visual question answering domains using Qwen3.5 models (122B and 35B parameters). ### 3.2. Main Results (Spreadsheet) The primary evaluation uses SpreadsheetBench-Verified (in-distribution) and WikiTableQuestions (OOD). Skills are evaluated as deltas against baselines: Deepening vs. Human-Written, Creation vs. Parametric/No Skill. Key findings are summarized in Table 1. **Table 1: Main results shown as deltas (%). Skill Author = model that evolved the skill; Skill User = model at inference.** | Condition | Skill User: 122B (SprBench Vrf / OOD WikiTQ) | Skill User: 35B (SprBench Vrf / OOD WikiTQ) | Avg ↑ | | :--- | :--- | :--- | :--- | | **Reference (absolute scores)** | | | | | No Skill | 27.67 / 21.50 | 19.00 / 13.33 | 18.35 | | Human-Written | 48.33 / 74.68 | 9.67 / 9.02 | 31.57 | | Parametric | 26.17 / 23.73 | 20.17 / 20.14 | 20.80 | | **Skill Author: 122B (Deepening)** | | | | | +Error | **+17.50** / +1.62 | **+27.00** / +9.26 | +9.18 | | +Combined | **+21.50** / +4.56 | **+21.16** / +6.64 | +9.19 | | **Skill Author: 122B (Creation)** | | | | | +Error | **+22.83** / +7.89 | +8.66 / +2.06 | +7.04 | | +Success | +15.33 / **+23.70** | **+12.83** / **+30.36** | **+17.62** | | **Skill Author: 35B (Creation)** | | | | | +Error | +1.00 / **+57.65** | +3.83 / +12.66 | **+18.26** | * **Human-Written skills are strong but not portable** (helps 122B, harms 35B). **Parametric skills are weak** (close to No Skill). * **Deepening reliably strengthens** human-written skills on in-distribution tasks and shows positive cross-model transfer. * **Creation substantially outperforms** the weak parametric baseline. In some settings (e.g., 35B-authored +Error skill used by 122B on WikiTQ), it even surpasses Human-Written performance (**+57.65 pp delta**, reaching 81.38%). * **Skills transfer across model scales and OOD domains.** For example, a skill evolved by Qwen3.5-35B on its own trajectories improved a Qwen3.5-122B agent by up to 57.65 pp on WikiTQ. * **+Combined** (using both error and success analysts) is the most consistently strong, while **+Error** is the most reliable signal. ### 3.3. Math Reasoning Applying Trace2Skill (Creation +Error) to mathematical reasoning (DAPO-Math, AIME 2026) yields consistent gains across models and benchmarks (Table 2). **Table 2: Math reasoning results (deltas from No Skill baseline).** | Condition | Skill User: 122B (D-Test / AIME) | Skill User: 35B (D-Test / AIME) | | :--- | :--- | :--- | | No Skill | 92.0 / 90.4 | 89.0 / 83.3 | | 122B-Authored +Error | **+3.0** / **+2.9** | **+5.0** / **+5.0** | | 35B-Authored +Error | **+2.0** / +1.3 | **+4.0** / +0.5 | ### 3.4. Visual Question Answering (VQA) Applying Trace2Skill to DocVQA presents a nuanced picture (Table 3). While the 35B model outperforms 122B on the task without skills, the **122B model is a superior skill author**. The 122B-authored skill provides large gains for both models, whereas the 35B-authored skill offers negligible or negative gains, suggesting skill authoring (inductive reasoning) is a distinct capability from task execution. **Table辖 3: DocVQA results (deltas from No Skill baseline).** | Condition | Skill User: 122B (ANLS / Acc) | Skill User: 35B (ANLS / Acc) | | :--- | :--- | :--- | | No Skill | 0.6424 / 71.2 | 0.6843 / 75.2 | | 122B-Authored +Error | **+0.1639** / **+15.3** | **+0.1554** / **+13.6** | | 35B-Authored +Error | +0.0093 / +0.9 | **-0.0620** / **-6.2** | ## Theoretical and Practical Implications * **Paradigm Shift in Experience Utilization:** Demonstrates that agent experience can be effectively distilled into **transferable, declarative skills** rather than being managed through task-specific episodic retrieval. This challenges the common assumption that experience is inherently non-generalizable. * **Efficiency and Scalability:** The parallel, holistic consolidation is computationally more efficient than sequential online updates (3 min vs. 60 min for 70 trajectories) and avoids the sequential drift problem. * **Superiority over Retrieval-Based Methods:** A single portable skill folder outperforms retrieval-based experience banks (ReasoningBank) by large margins (e.g., +13.8 pp on SprBench-Vrf for 122B), as it is not sensitive to query-similarity mismatches and integrates knowledge into the system prompt. * **Quality of Analysis Matters:** Agentic error analysis (with iterative diagnosis and validation) produces more transferable patches than single 办C-Call analysis, winning in average performance across settings. * **Accessibility:** Robust skill evolution is achievable with open-source models as small as 35B parameters, removing dependency on proprietary LLMs. **Generalizable SOPs Learned:** Inspection of evolved skills reveals prevalent, transferable patterns, such as: 1. **Formula recalculation and write-back verification** (most common error mode). 2. **Tool selection: `openpyxl` over `pandas.to_excel()`** to preserve formulas. 3. **Explicit read-back verification** after writing. 4. **Structural-edit safety** (e.g., delete rows in descending order). Niche quirks are automatically routed to supplementary reference files, mirroring human skill-design hierarchy. ## Conclusion Trace2Skill introduces a framework for automatic skill creation and adaptation that mirrors human expert methodology by holistically analyzing broad execution experience before distilling it into a concise, declarative artifact. The parallel, inductive consolidation process yields skills that are not only high-quality but also remarkably transferable across LLM scales and out-of-distribution tasks. The findings demonstrate that complex agent experience can be packaged into portable skills requiring no parameter updates, no external retrieval, and utilizing accessible open-source models. This offers a scalable path to equip LLM agents with specialized, robust capabilities. **Limitations and Future Work:** 1. **Causal effect quantification of editing patches:** Isolating the marginal contribution of individual patches is difficult with holistic consolidation. 2. **Tracing utility of specific skill sections:** Future work will focus on fine-grained attribution tracking to determine the exact utility of different skill components (e.g., specific checklist items vs. reference files), enabling automated pruning.