Summary of "From Context to Skills: Can Language Models Learn from Context Skillfully?"

Summary (Overview)

Ctx2Skill is a novel, self-evolving framework that autonomously discovers, refines, and selects natural-language skills from complex contexts without human annotation or external feedback.
The core mechanism is a multi-agent self-play loop where a Challenger and a Reasoner co-evolve their respective skill sets through failure-driven textual edits, guided by a neutral Judge.
A key innovation is the Cross-Time Replay mechanism, which prevents adversarial collapse by selecting the most generalizable skill set from all iterations, balancing performance on easy and hard probe tasks.
Evaluated on the challenging CL-bench for context learning, Ctx2Skill consistently improves task-solving rates across multiple backbone LMs (e.g., lifting GPT-4.1 from 11.1% to 16.5%, GPT-5.1 from 21.2% to 25.8%).
The resulting skills are transferable across models and provide interpretable, reusable procedural knowledge that can be plugged into any LM at inference time.

Introduction and Theoretical Foundation

Current language models (LMs) excel at tasks whose knowledge was present during pre-training but struggle with context learning, which requires reasoning over complex, previously unseen contexts (e.g., technical manuals, experimental data). An intuitive paradigm is inference-time skill augmentation—extracting rules and procedures from context into natural-language skills. However, this faces two fundamental challenges in context learning scenarios:

Prohibitive Cost for Manual Skill Annotation: Contexts are long, technically dense, and domain-specific, making human curation economically infeasible.
Lack of External Feedback for Automated Skill Construction: Unlike verifiable tasks (e.g., coding), there is no automatic feedback signal (execution, ground truth) to evaluate whether extracted skills are faithful or useful given only the context.

Existing automated skill construction methods rely on such external feedback or require parameter updates, rendering them inapplicable. Ctx2Skill is designed to overcome both challenges by autonomously discovering skills directly from the context alone via a skill-optimized self-play loop.

Methodology

Problem Formulation

A context learning task consists of a context $C$ , a set of tasks $T = \{ t_j \}$ , and binary rubrics $R_j = \{ r_{j,k} \}$ . Given an LM $\pi$ , a task is solved only if every rubric passes. The solving indicator $y_j(\pi; C)$ for task $t_j$ is defined as:

y_j(\pi; C) = \prod_k \mathbb{I}[ r_{j,k}(a_j) = \text{pass} ], \quad a_j \sim \pi(\cdot | C, t_j)

The goal is to enable the LM $\pi$ to solve tasks over an unseen context $C$ without parameter updates. To this end, a natural-language skill set $S$ is prepended to the system prompt:

a_j \sim \pi(\cdot | S, C, t_j)

During Ctx2Skill's construction, $S$ is instantiated as two separate skill sets: $S_R$ for the Reasoner and $S_C$ for the Challenger. At inference, only $S_R$ is deployed.

The Ctx2Skill Framework

The framework runs a multi-agent self-play loop for $N$ iterations. Five frozen-LM agent roles are involved:

Challenger: Given context $C$ and its skill set $S_C^{i-1}$ , generates a batch of $M$ tasks $\{t_m\}_{m=1}^M$ with corresponding rubrics $R_m$ .
Reasoner: Given $C$ , a task $t_m$ , and its skill set $S_R^{i-1}$ , produces an answer $a_m$ .
Judge: Evaluates the answer against each rubric, returning a per-rubric binary verdict $z_{m,k} = \mathbb{I}[r_{m,k}(a_m) = \text{pass}]$ and a solving indicator $y_m = \prod_k z_{m,k}$ .
Proposer (one per side): Analyzes all routed cases (failed for Reasoner, solved for Challenger) against the current skill set and synthesizes a high-level diagnosis specifying an action (add/merge), target skill name, description, and justification.
Generator (one per side): Materializes the Proposer's diagnosis into a complete replacement skill set ( $S_R^i$ or $S_C^i$ ).

At iteration $i$ , the Judge partitions the batch into failed cases $F_i = \{m: y_m = 0\}$ and solved cases $P_i = \{m: y_m = 1\}$ . Failed cases are routed to the Reasoner side for skill updates, while solved cases are routed to the Challenger side to tighten its probing strategies.

Cross-Time Replay Mechanism

A key risk in the self-play loop is adversarial collapse: the Challenger may generate increasingly extreme tasks while the Reasoner's skills over-specialize, degrading generalization. To address this, Cross-Time Replay selects the most generalizable skill set from candidates $\{S_R^1, ..., S_R^N\}$ .

Two probe sets are curated incrementally during self-play:

Hard probe set $Q_h$ : At each iteration, the failed task with the lowest rubric pass rate is added.
Easy probe set $Q_e$ : At each iteration, the solved task with the fewest rubrics is added.

After the loop terminates, the Laplace-smoothed solving rates $\rho_h(i)$ and $\rho_e(i)$ for each candidate skill set $S_R^i$ are computed:

\rho_h(i) = \frac{\sum_{q \in Q_h} y_q(\pi_R; C, S_R^i) + 1}{|Q_h| + 1}, \quad \rho_e(i) = \frac{\sum_{q \in Q_e} y_q(\pi_R; C, S_R^i) + 1}{|Q_e| + 1}

The selected skill set maximizes the product of these rates:

S_R^* = S_R^{i^*}, \quad i^* = \arg\max_i \left( \rho_h(i) \cdot \rho_e(i) \right)

The multiplicative form ensures a balanced skill set that does not sacrifice easy-task performance for hard-task gains.

Empirical Validation / Results

Evaluation is conducted on CL-bench, a benchmark comprising 500 complex contexts, 1,899 tasks, and 31,607 rubrics across four categories: Domain Knowledge Reasoning, Rule System Application, Procedural Task Execution, and Empirical Discovery & Simulation. Scoring is strict all-or-nothing.

Baselines: Frontier LMs (GPT-4.1, GPT-5.1, etc.), Prompting (single-pass skill generation), and AutoSkill4Doc (a variant of AutoSkill for document contexts).

Main Results

Ctx2Skill consistently and substantially improves solving rates across all backbones and categories.

Table 1: Main results on CL-bench (task-solving rate in %).

Model	Overall	Domain Knowledge Reasoning	Rule System Application	Procedural Task Execution	Empirical Discovery & Simulation
Frontier LMs (without skills)
GPT-5.1	21.1	22.4	21.0	22.8	13.6
GPT-4.1	11.1	10.6	14.8	10.4	4.6
GPT-4.1-based Methods
Prompting	12.3 (+1.2)	12.4 (+1.8)	12.3 (-2.5)	13.9 (+3.5)	8.2 (+3.6)
AutoSkill4Doc	13.2 (+2.1)	13.3 (+2.7)	13.1 (-1.7)	15.0 (+4.6)	8.7 (+4.1)
Ctx2Skill	16.5 (+5.4)	16.8 (+6.2)	17.6 (+2.8)	17.6 (+7.2)	9.7 (+5.1)
GPT-5.1-based Methods
Prompting	22.1 (+1.0)	24.7 (+2.3)	21.1 (+0.1)	22.4 (-0.4)	15.5 (+1.9)
AutoSkill4Doc	22.7 (+1.6)	25.3 (+2.9)	21.5 (+0.5)	23.1 (+0.3)	16.0 (+2.4)
Ctx2Skill	25.8 (+4.7)	27.9 (+5.5)	24.9 (+3.9)	26.9 (+4.1)	19.1 (+5.5)

Table 2: Skill quality evaluation (scores 1-5).

Model	Conciseness	Faithfulness	Clarity	Effectiveness	Reusability	Avg.
GPT-4.1-based Methods
Prompting	81.2	79.7	80.0	83.3	84.7	81.8
AutoSkill4Doc	81.3	81.4	92.4	88.7	87.2	86.2
Ctx2Skill	85.2	84.8	96.2	90.5	92.5	89.8

Analysis and Ablations

Ablation studies (Table 3) confirm the importance of each component:

Removing Challenger skills evolving causes the largest drop, confirming sustained adversarial pressure is essential.
Removing the Cross-Time Replay mechanism is the second most impactful, validating its role in preventing adversarial collapse.
Variant design testing shows that the multiplicative scoring in Eq. (4) outperforms additive scoring.

Effect of Cross-Time Replay: Performance monotonically decreases from Iter-1 to Iter-5 when using fixed-iteration skills, confirming adversarial collapse. Cross-Time Replay adaptively selects the best iteration per context, outperforming any fixed iteration.

Skill Transferability: Skills generated by stronger models (GPT-5.1) transfer well to weaker models (GPT-4.1), but not vice versa, showing an asymmetry in discovered knowledge.

Case Studies: Qualitative case studies across the four CL-bench categories (Figures 5-8) demonstrate that Ctx2Skill skills lead to more structured, rubric-compliant responses by enforcing explicit constraint verification and procedural adherence.

Theoretical and Practical Implications

Theoretical: Ctx2Skill introduces a novel paradigm for unsupervised skill discovery in feedback-free environments. The skill-optimized self-play and Cross-Time Replay mechanisms provide a general framework for co-evolutionary learning without parameter updates.
Practical: The framework is scalable and cost-effective. Skills are constructed once per context and amortized over all downstream tasks. The resulting skills are interpretable (natural language), reusable, and transferable, enhancing the context learning capability of any LM without fine-tuning.
Broader Impact: Ctx2Skill addresses a critical gap in enabling LMs to reason over complex, domain-specific real-world contexts, with applications in technical support, scientific discovery, legal analysis, and more.

Conclusion

Ctx2Skill is a self-evolving framework that autonomously constructs context-specific skills from complex contexts without human supervision or external feedback. Through a skill-optimized self-play loop and a Cross-Time Replay mechanism, it produces generalizable, high-quality skills that substantially improve LM performance on challenging context learning tasks. The work provides a practical and scalable paradigm for equipping LMs with the ability to learn skillfully from unseen contexts, bridging a key capability gap for real-world applications. Future work may explore extending the framework to verifiable domains and scaling the self-play iterations.