Learn from Weaknesses: Automated Domain Specialization for Small Computer-Use Agents

Summary (Overview)

  • Problem: Small Computer-Use Agents (CUAs) are practical for deployment but suffer from a performance gap compared to large models, especially in domain-specific software environments. Naively fine-tuning them on large-scale synthetic data yields marginal improvements.
  • Core Solution: Introduces LearnWeak, a fully automated, two-stage domain specialization framework. It uses a stronger teacher agent to identify the student's specific weaknesses in a target domain and then generates targeted training data and applies specialized supervision to correct them.
  • Key Innovations:
    1. LearnWeak-GEN: An iterative, annotation-free data generation pipeline that synthesizes new tasks conditioned on a weakness report derived from comparing teacher and student failures.
    2. LearnWeak-DPO: An error-aware Direct Preference Optimization (DPO) training objective that distinguishes between planning and execution errors and applies selective, behaviorally precise updates.
  • Main Results: LearnWeak achieves average performance gains of +11.6% (EvoCUA-8B) and +11.1% (OpenCUA-7B) across eight OSWorld domains. The specialized small agents even surpass the teacher model in several domains.
  • Implication: Highlights that efficient domain specialization depends on student-aware targeting of weaknesses, not just scaling data volume. This provides a principled path for closing the performance gap for small, deployable CUAs.

Introduction and Theoretical Foundation

Computer-use agents (CUAs) that interact with GUI environments have advanced rapidly. Two paradigms exist: large proprietary models (e.g., Claude, GPT) and fine-tuned small open models (e.g., EvoCUA, OpenCUA). The latter is compelling for real-world deployment due to cost efficiency, privacy, and edge-device compatibility. However, a substantial performance gap persists, especially in domain-specific software with unique conventions.

Domain specialization—fine-tuning an agent for a single target domain—is a promising approach to close this gap for small CUAs. It improves sample efficiency by focusing on domain-specific patterns and avoids issues like catastrophic forgetting from heterogeneous training. The core challenge lies in the two stages of specialization:

  1. Dataset Generation: Collecting human trajectories is costly. Existing autonomous generation strategies are weakness-agnostic; they do not consider the student's specific deficiencies, leading to inefficient training.
  2. Agent Training: Naive fine-tuning can distort the student's own learned reasoning patterns. Furthermore, failures are heterogeneous (planning vs. execution errors), calling for tailored training objectives.

The paper's foundational insight is that for efficient specialization, the most useful supervision targets the student's actual weaknesses rather than providing broad domain coverage. LearnWeak is built on this principle of student-awareness in both data synthesis and training.

Methodology

LearnWeak decomposes domain specialization into two coupled stages: Weakness-Aware Data Generation (LearnWeak-GEN) and Agent Training via Error-Aware Preference Optimization (LearnWeak-DPO).

1. Weakness-Aware Data Generation (LearnWeak-GEN)

The goal is to autonomously generate a domain-specific dataset DdD^d starting from a small set of seed queries Q0dQ^d_0, without human annotation. The process is iterative.

  • Seed Setup: Initialize with a small set (KK) of executable environment configurations and seed tasks.
  • Weakness Discovery (Iteration ii):
    • For each task qQidq \in Q^d_i, execute both the teacher policy πT\pi_T and the (fixed, pre-adaptation) student policy πS\pi_S, yielding trajectories τqT\tau^T_q and τqS\tau^S_q.
    • Use a verifier VV to get success outcomes (vqT,vqS)(v^T_q, v^S_q) and rationales (rqT,rqS)(r^T_q, r^S_q).
    • Collect the failure set where the teacher succeeds but the student fails: Fid={qQid  vqT=1,vqS=0}.F^d_i = \{ q \in Q^d_i \ | \ v^T_q = 1, v^S_q = 0 \}.
    • Summarize the failure rationales into a weakness report RidR^d_i.
  • Screenshot-Guided Query Generation:
    • Construct a representative screenshot set SidS^d_i from trajectories via clustering and VLM reranking.
    • Generate new queries for the next iteration using a generator GG, conditioned on previous tasks QidQ^d_i, the weakness report RidR^d_i, screenshots SidS^d_i, and domain metadata MdM^d. Two strategies are used:
      • Weakness-focused: Qi+1weak=G(Qid,Rid,Sid,Md)Q^{\text{weak}}_{i+1} = G(Q^d_i, R^d_i, S^d_i, M^d)
      • Exploration-focused: Qi+1explore=G(Qid,,Sid,Md)Q^{\text{explore}}_{i+1} = G(Q^d_i, \emptyset, S^d_i, M^d)
    • Qi+1d=Qi+1weakQi+1exploreQ^d_{i+1} = Q^{\text{weak}}_{i+1} \cup Q^{\text{explore}}_{i+1}.
  • Iteration & Final Dataset: Repeat for NN iterations. Aggregate all failure sets Fd=i=0N1FidF^d = \bigcup_{i=0}^{N-1} F^d_i. The final training dataset is: Dd(πS)={(q,τqT,τqS)  qFd(πS)}.D^d(\pi_S) = \{ (q, \tau^T_q, \tau^S_q) \ | \ q \in F^d(\pi_S) \}.

2. Agent Training for Domain Specialization (LearnWeak-DPO)

The goal is to train a specialized student π^S,d\hat{\pi}_{S,d} using the generated dataset DdD^d, preserving pretrained capabilities while correcting weaknesses.

  • Teacher-Replay Preference Construction:

    • For each failed task qFdq \in F^d, replay the teacher trajectory step-by-step.
    • At each step tt, query the student policy πS\pi_S with the teacher's context ctT=(q,otT,htT)c^T_t = (q, o^T_t, h^T_t) to get a replayed student response a^tSπS(ctT)\hat{a}^S_t \sim \pi_S(\cdot | c^T_t).
    • If the tool executions differ (etTe^tSe^T_t \neq \hat{e}^S_t), build a preference tuple: (ctT,at+,at)=(ctT,atT,a^tS)(c^T_t, a^+_t, a^-_t) = (c^T_t, a^T_t, \hat{a}^S_t).
    • Aggregate these into a step-level preference dataset DprefdD^d_{\text{pref}}.
  • Error-Aware Preference Optimization:

    • Failure Type Categorization: Decompose tool execution et=(ft,pt)e_t = (f_t, p_t).
      • Planning error (ϵPLAN\epsilon_{\text{PLAN}}): ftTf^tSf^T_t \neq \hat{f}^S_t (wrong action type).
      • Execution error (ϵEXEC\epsilon_{\text{EXEC}}): ftT=f^tSf^T_t = \hat{f}^S_t but ptTp^tSp^T_t \neq \hat{p}^S_t (wrong action parameters).
    • Adaptive Token Masking: Define a binary mask m(j)m(j) over token positions jj in the action at=(rt,st,et)a_t = (r_t, s_t, e_t): 0 & \text{if } a^{(j)}_t \in r_t \text{ (reasoning)}, \\ g(t) & \text{if } a^{(j)}_t \in s_t \text{ (action description)}, \\ 1 & \text{if } a^{(j)}_t \in e_t \text{ (tool execution)}, \end{cases} $$ where $g(t) = 1$ if $\epsilon_t = \epsilon_{\text{PLAN}}$, and $0$ otherwise. This mask ensures updates are focused on the behaviorally relevant span (execution for all errors, plus description for planning errors).
    • Masked Action Score: For a context cc and action ata_t, the masked score is: sθ(c,at;m)=j=1atm(j)logπθ(at(j)  c,at(<j)).s_\theta(c, a_t; m) = \sum_{j=1}^{|a_t|} m(j) \log \pi_\theta(a^{(j)}_t \ | \ c, a^{(<j)}_t).
    • DPO Objective: The final optimization loss is: LDPO=E(ct,at+,at)Dprefd[logσ(β(sθ(ct,at+;m)sθ(ct,at;m)sref(ct,at+;m)+sref(ct,at;m)))],\mathcal{L}_{\text{DPO}} = -\mathbb{E}_{(c_t, a^+_t, a^-_t) \sim D^d_{\text{pref}}} \left[ \log \sigma \left( \beta \left( s_\theta(c_t, a^+_t; m) - s_\theta(c_t, a^-_t; m) - s_{\text{ref}}(c_t, a^+_t; m) + s_{\text{ref}}(c_t, a^-_t; m) \right) \right) \right], where σ\sigma is the sigmoid, β\beta is a temperature, and πref\pi_{\text{ref}} is a frozen reference policy (the base student).
  • Domain Scalability: Specialization is achieved via modular LoRA adapters. The base student policy πS\pi_S is frozen, and only a domain-specific LoRA adapter Δd\Delta_d is updated:

    π^S,d=πSΔd.\hat{\pi}_{S,d} = \pi_S \oplus \Delta_d.

    This allows a library of per-domain adapters to be activated at inference time.

Empirical Validation / Results

Setup: Evaluation on 8 domains of the OSWorld benchmark (Gimp, Calc, Impress, Writer, OS, Thunderbird, VLC, VSCode). Student models: EvoCUA-8B and OpenCUA-7B. Teacher model: EvoCUA 32B.

Main Results: Domain Specialization Performance

Table 1: Domain specialization results on OSWorld. Each entry reports mean success rate (%). Yellow and blue denote the teacher policy and specialized student with LearnWeak, respectively.

ModelGimpCalcImpressWriterOSThunderbirdVLCVSCodeAvg.
Generalized Models
Kimi K2.673.0880.8582.1973.9179.1780.0075.7191.3079.53
Claude Sonnet 4.669.2374.4770.2186.8391.6766.6781.4172.7376.65
CUA Models
EvoCUA-32B (Teacher)76.2951.0652.9865.2275.0060.0064.6565.2263.80
EvoCUA-8B66.1528.0737.6650.4360.8365.3345.7151.3050.69
EvoCUA-8B + Ours82.0541.1350.3555.0766.6773.3356.8672.4662.24
+15.9+13.1+12.7+4.6+5.8+8.0+11.2+21.2+11.6
OpenCUA-7B48.4611.9131.4930.4340.0054.6732.9451.3037.65
OpenCUA-7B + Ours57.6919.1536.8840.5859.4266.6747.0662.3248.72
+9.2+7.2+5.4+10.2+19.4+12.0+14.1+11.0+11.1
  • Consistent Gains: LearnWeak yields significant improvements for both small CUAs across all eight domains.
  • Surpassing the Teacher: The specialized EvoCUA-8B outperforms the 32B teacher in Gimp, Thunderbird, and VSCode, demonstrating that corrective supervision can lead to mastery beyond imitation.
  • Student-Dependent Gains: The domains with the largest improvement vary per student model, indicating specialization effectiveness is tied to the student's specific adaptation needs.

Comparison with Dataset Construction Baselines

Under a matched training budget, LearnWeak-GEN is compared against various data construction pipelines.

Table 2: Comparison with data-construction baselines on four OSWorld domains. Mean success rate (%) under a matched budget.

MethodCalcImpressVLCVSCodeAvg.
Zero-shot (EvoCUA-8B)28.0737.6645.7151.3040.69
Existing Data
AgentNet (Full)34.0439.0149.0169.5747.91
AgentNet (N-sampled)32.6240.4349.0263.7746.46
Minimal Human Annotation
Trajectory Boosting30.5019.8845.1049.2836.19
Zero Human Annotation
AgentSynth31.2139.0139.2271.0145.11
OS-Genesis31.9137.5945.1068.1245.68
ZeroGUI36.1740.4348.8662.3046.94
WebSTAR (Filtering)31.2140.4352.9473.9149.62
LearnWeak (Ours)41.1350.3556.8672.4655.20
  • LearnWeak achieves the best average performance, outperforming the next best (WebSTAR) by 5.58 percentage points.
  • Weakness-agnostic generation methods (AgentSynth, OS-Genesis, ZeroGUI) perform comparably to retraining on existing human data (AgentNet), confirming that data volume/exploration alone is insufficient.
  • The advantage of LearnWeak is most evident in domains where there is room for improvement beyond basic exploration.

Theoretical and Practical Implications

  • Student-Awareness is Key: The paper establishes that effective domain specialization requires targeting the student's specific weaknesses, not just generating domain data. This is a shift from weakness-agnostic data scaling.
  • Efficient Specialization Pathway: LearnWeak provides a fully automated, annotation-free framework that makes specializing small CUAs for diverse software domains practical and scalable.
  • Beyond Imitation: The results show that specialized small agents can surpass their teacher in certain domains. This indicates that the framework facilitates learning and correction, not mere copying.
  • Modular Deployment: The LoRA-based design enables a scalable multi-application deployment scenario, where a shared base model is coupled with lightweight, domain-specific adapters activated on-demand.
  • Broader Applicability: While focused on CUAs, the core principles of iterative weakness identification and error-aware preference optimization could inform specialization strategies for other types of interactive AI agents.

Conclusion

LearnWeak presents a principled framework for automated domain specialization of small computer-use agents. Its core innovation is a student-aware approach that identifies and repairs model-specific weaknesses through:

  1. Iterative, weakness-focused data generation (LearnWeak-GEN).
  2. Error-aware preference optimization with selective updates (LearnWeak-DPO).

Empirical results demonstrate substantial and consistent gains across diverse software domains, enabling small specialized agents to narrow the performance gap with larger models and even surpass them in some cases. This work points toward a more efficient and targeted path for deploying performant, small CUAs in real-world software environments.

Future Directions: Systematic study of multi-adapter routing across many domains, exploration of the framework's applicability to agents lacking foundational GUI skills, and investigating the limits of teacher reliability.