Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Summary (Overview)

Problem: Small Computer-Use Agents (CUAs) are practical for deployment but suffer from a performance gap compared to large models, especially in domain-specific software environments. Naively fine-tuning them on large-scale synthetic data yields marginal improvements.
Core Solution: Introduces LearnWeak, a fully automated, two-stage domain specialization framework. It uses a stronger teacher agent to identify the student's specific weaknesses in a target domain and then generates targeted training data and applies specialized supervision to correct them.
Key Innovations:
1. LearnWeak-GEN: An iterative, annotation-free data generation pipeline that synthesizes new tasks conditioned on a weakness report derived from comparing teacher and student failures.
2. LearnWeak-DPO: An error-aware Direct Preference Optimization (DPO) training objective that distinguishes between planning and execution errors and applies selective, behaviorally precise updates.
Main Results: LearnWeak achieves average performance gains of +11.6% (EvoCUA-8B) and +11.1% (OpenCUA-7B) across eight OSWorld domains. The specialized small agents even surpass the teacher model in several domains.
Implication: Highlights that efficient domain specialization depends on student-aware targeting of weaknesses, not just scaling data volume. This provides a principled path for closing the performance gap for small, deployable CUAs.

Introduction and Theoretical Foundation

Computer-use agents (CUAs) that interact with GUI environments have advanced rapidly. Two paradigms exist: large proprietary models (e.g., Claude, GPT) and fine-tuned small open models (e.g., EvoCUA, OpenCUA). The latter is compelling for real-world deployment due to cost efficiency, privacy, and edge-device compatibility. However, a substantial performance gap persists, especially in domain-specific software with unique conventions.

Domain specialization—fine-tuning an agent for a single target domain—is a promising approach to close this gap for small CUAs. It improves sample efficiency by focusing on domain-specific patterns and avoids issues like catastrophic forgetting from heterogeneous training. The core challenge lies in the two stages of specialization:

Dataset Generation: Collecting human trajectories is costly. Existing autonomous generation strategies are weakness-agnostic; they do not consider the student's specific deficiencies, leading to inefficient training.
Agent Training: Naive fine-tuning can distort the student's own learned reasoning patterns. Furthermore, failures are heterogeneous (planning vs. execution errors), calling for tailored training objectives.

The paper's foundational insight is that for efficient specialization, the most useful supervision targets the student's actual weaknesses rather than providing broad domain coverage. LearnWeak is built on this principle of student-awareness in both data synthesis and training.

Methodology

LearnWeak decomposes domain specialization into two coupled stages: Weakness-Aware Data Generation (LearnWeak-GEN) and Agent Training via Error-Aware Preference Optimization (LearnWeak-DPO).

1. Weakness-Aware Data Generation (LearnWeak-GEN)

The goal is to autonomously generate a domain-specific dataset $D^d$ starting from a small set of seed queries $Q^d_0$ , without human annotation. The process is iterative.

Seed Setup: Initialize with a small set ( $K$ ) of executable environment configurations and seed tasks.
Weakness Discovery (Iteration $i$ ):
- For each task $q \in Q^d_i$ , execute both the teacher policy $\pi_T$ and the (fixed, pre-adaptation) student policy $\pi_S$ , yielding trajectories $\tau^T_q$ and $\tau^S_q$ .
- Use a verifier $V$ to get success outcomes $(v^T_q, v^S_q)$ and rationales $(r^T_q, r^S_q)$ .
- Collect the failure set where the teacher succeeds but the student fails: $F^d_i = \{ q \in Q^d_i \ | \ v^T_q = 1, v^S_q = 0 \}.$
- Summarize the failure rationales into a weakness report $R^d_i$ .
Screenshot-Guided Query Generation:
- Construct a representative screenshot set $S^d_i$ from trajectories via clustering and VLM reranking.
- Generate new queries for the next iteration using a generator $G$ $G$ , conditioned on previous tasks $Q^d_i$ $Q_{i}^{d}$ , the weakness report $R^d_i$ $R_{i}^{d}$ , screenshots $S^d_i$ $S_{i}^{d}$ , and domain metadata $M^d$ $M^{d}$ . Two strategies are used:
  - Weakness-focused: $Q^{\text{weak}}_{i+1} = G(Q^d_i, R^d_i, S^d_i, M^d)$
  - Exploration-focused: $Q^{\text{explore}}_{i+1} = G(Q^d_i, \emptyset, S^d_i, M^d)$
- $Q^d_{i+1} = Q^{\text{weak}}_{i+1} \cup Q^{\text{explore}}_{i+1}$ .
Iteration & Final Dataset: Repeat for $N$ iterations. Aggregate all failure sets $F^d = \bigcup_{i=0}^{N-1} F^d_i$ . The final training dataset is: $D^d(\pi_S) = \{ (q, \tau^T_q, \tau^S_q) \ | \ q \in F^d(\pi_S) \}.$

2. Agent Training for Domain Specialization (LearnWeak-DPO)

The goal is to train a specialized student $\hat{\pi}_{S,d}$ using the generated dataset $D^d$ , preserving pretrained capabilities while correcting weaknesses.

Teacher-Replay Preference Construction:
- For each failed task $q \in F^d$ , replay the teacher trajectory step-by-step.
- At each step $t$ , query the student policy $\pi_S$ with the teacher's context $c^T_t = (q, o^T_t, h^T_t)$ to get a replayed student response $\hat{a}^S_t \sim \pi_S(\cdot | c^T_t)$ .
- If the tool executions differ ( $e^T_t \neq \hat{e}^S_t$ ), build a preference tuple: $(c^T_t, a^+_t, a^-_t) = (c^T_t, a^T_t, \hat{a}^S_t)$ .
- Aggregate these into a step-level preference dataset $D^d_{\text{pref}}$ .
Error-Aware Preference Optimization:
- Failure Type Categorization: Decompose tool execution $e_t = (f_t, p_t)$ $e_{t} = (f_{t}, p_{t})$ .
  - Planning error ( $\epsilon_{\text{PLAN}}$ ): $f^T_t \neq \hat{f}^S_t$ (wrong action type).
  - Execution error ( $\epsilon_{\text{EXEC}}$ ): $f^T_t = \hat{f}^S_t$ but $p^T_t \neq \hat{p}^S_t$ (wrong action parameters).
- Adaptive Token Masking: Define a binary mask $m(j)$ over token positions $j$ in the action $a_t = (r_t, s_t, e_t)$ : $0 & \text{if } a^{(j)}_t \in r_t \text{ (reasoning)}, \\ g(t) & \text{if } a^{(j)}_t \in s_t \text{ (action description)}, \\ 1 & \text{if } a^{(j)}_t \in e_t \text{ (tool execution)}, \end{cases} $$ where $g(t) = 1$ if $\epsilon_t = \epsilon_{\text{PLAN}}$, and $0$ otherwise. This mask ensures updates are focused on the behaviorally relevant span (execution for all errors, plus description for planning errors).$
- Masked Action Score: For a context $c$ and action $a_t$ , the masked score is: $s_\theta(c, a_t; m) = \sum_{j=1}^{|a_t|} m(j) \log \pi_\theta(a^{(j)}_t \ | \ c, a^{(<j)}_t).$
- DPO Objective: The final optimization loss is: $\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(c_t, a^+_t, a^-_t) \sim D^d_{\text{pref}}} \left[ \log \sigma \left( \beta \left( s_\theta(c_t, a^+_t; m) - s_\theta(c_t, a^-_t; m) - s_{\text{ref}}(c_t, a^+_t; m) + s_{\text{ref}}(c_t, a^-_t; m) \right) \right) \right],$ where $\sigma$ is the sigmoid, $\beta$ is a temperature, and $\pi_{\text{ref}}$ is a frozen reference policy (the base student).
Domain Scalability: Specialization is achieved via modular LoRA adapters. The base student policy $\pi_S$ is frozen, and only a domain-specific LoRA adapter $\Delta_d$ is updated:
$\hat{\pi}_{S,d} = \pi_S \oplus \Delta_d.$
This allows a library of per-domain adapters to be activated at inference time.

Empirical Validation / Results

Setup: Evaluation on 8 domains of the OSWorld benchmark (Gimp, Calc, Impress, Writer, OS, Thunderbird, VLC, VSCode). Student models: EvoCUA-8B and OpenCUA-7B. Teacher model: EvoCUA 32B.

Main Results: Domain Specialization Performance

Table 1: Domain specialization results on OSWorld. Each entry reports mean success rate (%). Yellow and blue denote the teacher policy and specialized student with LearnWeak, respectively.

Model	Gimp	Calc	Impress	Writer	OS	Thunderbird	VLC	VSCode	Avg.
Generalized Models
Kimi K2.6	73.08	80.85	82.19	73.91	79.17	80.00	75.71	91.30	79.53
Claude Sonnet 4.6	69.23	74.47	70.21	86.83	91.67	66.67	81.41	72.73	76.65
CUA Models
EvoCUA-32B (Teacher)	76.29	51.06	52.98	65.22	75.00	60.00	64.65	65.22	63.80
EvoCUA-8B	66.15	28.07	37.66	50.43	60.83	65.33	45.71	51.30	50.69
EvoCUA-8B + Ours	82.05	41.13	50.35	55.07	66.67	73.33	56.86	72.46	62.24
∆	+15.9	+13.1	+12.7	+4.6	+5.8	+8.0	+11.2	+21.2	+11.6
OpenCUA-7B	48.46	11.91	31.49	30.43	40.00	54.67	32.94	51.30	37.65
OpenCUA-7B + Ours	57.69	19.15	36.88	40.58	59.42	66.67	47.06	62.32	48.72
∆	+9.2	+7.2	+5.4	+10.2	+19.4	+12.0	+14.1	+11.0	+11.1

Consistent Gains: LearnWeak yields significant improvements for both small CUAs across all eight domains.
Surpassing the Teacher: The specialized EvoCUA-8B outperforms the 32B teacher in Gimp, Thunderbird, and VSCode, demonstrating that corrective supervision can lead to mastery beyond imitation.
Student-Dependent Gains: The domains with the largest improvement vary per student model, indicating specialization effectiveness is tied to the student's specific adaptation needs.

Comparison with Dataset Construction Baselines

Under a matched training budget, LearnWeak-GEN is compared against various data construction pipelines.

Table 2: Comparison with data-construction baselines on four OSWorld domains. Mean success rate (%) under a matched budget.

Method	Calc	Impress	VLC	VSCode	Avg.
Zero-shot (EvoCUA-8B)	28.07	37.66	45.71	51.30	40.69
Existing Data
AgentNet (Full)	34.04	39.01	49.01	69.57	47.91
AgentNet (N-sampled)	32.62	40.43	49.02	63.77	46.46
Minimal Human Annotation
Trajectory Boosting	30.50	19.88	45.10	49.28	36.19
Zero Human Annotation
AgentSynth	31.21	39.01	39.22	71.01	45.11
OS-Genesis	31.91	37.59	45.10	68.12	45.68
ZeroGUI	36.17	40.43	48.86	62.30	46.94
WebSTAR (Filtering)	31.21	40.43	52.94	73.91	49.62
LearnWeak (Ours)	41.13	50.35	56.86	72.46	55.20

LearnWeak achieves the best average performance, outperforming the next best (WebSTAR) by 5.58 percentage points.
Weakness-agnostic generation methods (AgentSynth, OS-Genesis, ZeroGUI) perform comparably to retraining on existing human data (AgentNet), confirming that data volume/exploration alone is insufficient.
The advantage of LearnWeak is most evident in domains where there is room for improvement beyond basic exploration.

Theoretical and Practical Implications

Student-Awareness is Key: The paper establishes that effective domain specialization requires targeting the student's specific weaknesses, not just generating domain data. This is a shift from weakness-agnostic data scaling.
Efficient Specialization Pathway: LearnWeak provides a fully automated, annotation-free framework that makes specializing small CUAs for diverse software domains practical and scalable.
Beyond Imitation: The results show that specialized small agents can surpass their teacher in certain domains. This indicates that the framework facilitates learning and correction, not mere copying.
Modular Deployment: The LoRA-based design enables a scalable multi-application deployment scenario, where a shared base model is coupled with lightweight, domain-specific adapters activated on-demand.
Broader Applicability: While focused on CUAs, the core principles of iterative weakness identification and error-aware preference optimization could inform specialization strategies for other types of interactive AI agents.

Conclusion

LearnWeak presents a principled framework for automated domain specialization of small computer-use agents. Its core innovation is a student-aware approach that identifies and repairs model-specific weaknesses through:

Iterative, weakness-focused data generation (LearnWeak-GEN).
Error-aware preference optimization with selective updates (LearnWeak-DPO).

Empirical results demonstrate substantial and consistent gains across diverse software domains, enabling small specialized agents to narrow the performance gap with larger models and even surpass them in some cases. This work points toward a more efficient and targeted path for deploying performant, small CUAs in real-world software environments.

Future Directions: Systematic study of multi-adapter routing across many domains, exploration of the framework's applicability to agents lacking foundational GUI skills, and investigating the limits of teacher reliability.