ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents - Summary

Summary (Overview)

  • Unified Open-Source Framework: ClawGUI integrates scalable online RL training (ClawGUI-RL), fully standardized evaluation (ClawGUI-Eval), and real-device deployment (ClawGUI-Agent) into a single, coherent pipeline for GUI agent development.
  • First Open-Source GUI Agent RL Infrastructure: ClawGUI-RL supports training on both parallel virtual environments and real physical devices, integrating the GiGPO algorithm with a Process Reward Model (PRM) for dense step-level supervision.
  • High-Fidelity Reproducible Evaluation: ClawGUI-Eval enforces a strict Infer → Judge → Metric pipeline across 6 benchmarks and 11+ models, achieving a 95.8% reproduction rate against official baselines, addressing critical reproducibility issues in the field.
  • Deployment to Real Users: ClawGUI-Agent enables deployment to Android, HarmonyOS, and iOS through 12+ chat platforms, featuring hybrid CLI-GUI control and a persistent personalized memory system.
  • Empirical Validation: The model ClawGUI-2B, trained end-to-end within the framework, achieves a 17.1% Success Rate on the MobileWorld GUI-Only benchmark, outperforming the same-scale MAI-UI-2B baseline (11.1%) by 6.0% absolute (54% relative).

Introduction and Theoretical Foundation

Graphical User Interface (GUI) agents, which interact with software via visual perception and low-level actions (tap, swipe, type), promise universal digital automation. However, progress is bottlenecked by a lack of coherent full-stack infrastructure rather than model capacity. The paper identifies three critical gaps:

  1. Closed Training Ecosystem: Strong online RL results are reported but underlying infrastructure is not released, and training on real physical devices remains unexplored.
  2. Misaligned Evaluation: Reported numbers across papers are not directly comparable due to undocumented differences in prompts, resolution, and normalization conventions.
  3. Broken Deployment Loop: Trained agents rarely reach end-users. CLI-based agents have limited coverage, while GUI-based agents lack integration into daily workflows.

ClawGUI is proposed to close all three gaps within a single open-source system, providing a unified foundation for the community to build, evaluate, and deploy GUI agents.

Methodology

ClawGUI consists of three integrated modules.

3.2 ClawGUI-RL: Scalable Online RL Training

Environment Manager: Abstracts device backends (virtual emulators, real devices) behind a unified interface. It features:

  • Virtual Environment: Parallel Docker-based Android emulators with lifecycle management (task reset, evaluation, spare server rotation, teardown).
  • Real Device Training: Support for physical Android/cloud phones with human-authored tasks and MLLM-based evaluation.

Reward Design: A two-level formulation to address reward sparsity in long-horizon tasks.

  • Binary Outcome Reward: RoutcomeR_{\text{outcome}} is 1 for task success, 0 for failure.
  • Dense Step-Level Reward via PRM: A Process Reward Model judges each action's contribution, producing a per-step score RstepR_{\text{step}}.
  • Combined Reward: The total reward is defined as: R=Routcome+RstepR = R_{\text{outcome}} + R_{\text{step}}

RL Trainer: Built upon verl and verl-agent, supporting algorithms like Reinforce++, PPO, GSPO, GRPO, and GiGPO.

  • GRPO (Group Relative Policy Optimization): Estimates advantages by normalizing returns within a group of rollouts sharing the same task. It assigns a uniform episode-level advantage, which is coarse for multi-step GUI tasks.
  • GiGPO (Group-in-Group Policy Optimization): Employs a two-level hierarchical advantage estimation:
    1. Macro-level: Retains relative advantage across complete trajectories.
    2. Micro-level: Introduces anchor-state grouping. Steps that encounter the same intermediate state across different rollouts are clustered, and micro relative advantages are estimated within each sub-group via discounted return normalization. This enables fine-grained per-step credit assignment.

3.3 ClawGUI-Eval: Reproducible GUI Evaluation

Benchmark and Model Coverage: Covers 6 benchmarks: ScreenSpot-Pro, ScreenSpot-V2, UI-Vision, MMBench-GUI, OSWorld-G, AndroidControl. Supports 11+ models including Qwen3-VL, UI-TARS, MAI-UI, Gemini, Seed.

Pipeline Architecture: A strict three-stage, decoupled pipeline:

  1. Infer: Generates raw predictions via local GPU (transformers) or remote API inference, with multi-GPU parallelism and shard-level checkpointing.
  2. Judge: Applies benchmark-specific judges (e.g., point-in-box, polygon-aware) to parse outputs and produce per-sample correctness labels.
  3. Metric: Aggregates labels into final accuracy scores with fine-grained breakdowns.

3.4 ClawGUI-Agent: Personal GUI Assistant

Hybrid Device Control: Combines the efficiency of CLI for supported operations with the universal coverage of GUI, addressing the limitations of each paradigm alone.

Personalized Memory: Automatically extracts and stores structured facts (contacts, app usage, preferences) as vector embeddings. Top-kk similar memories are retrieved for subsequent tasks, enabling adaptation to individual users.

Deployment Modes:

  • Remote Control: Users issue tasks via 12+ chat platforms (Feishu, Telegram, etc.) to control a target phone remotely.
  • Local Control: The agent takes over the local device from a chat app running on the same phone.

ClawGUI-Eval as a Skill: The entire evaluation pipeline can be triggered via a single natural language command through the agent interface.

Empirical Validation / Results

4.2 Main Results

Table 1: Comparison of models on MobileWorld GUI-Only (117 tasks) benchmark.

ModelMobileWorld SR (GUI-Only)Agentic Framework
Claude-4.5-Sonnet + UI-Ins-7B47.8Yes
Gemini-3-Pro + UI-Ins-7B55.6Yes
GPT-5 + UI-Ins-7B54.0Yes
GUI-Owl-7B7.7No
UI-Venus-72B16.4No
Qwen3-VL-32B11.9No
Doubao-1.5-UI-TARS26.3No
MAI-UI-2B11.1No
MAI-UI-8B19.7No
ClawGUI-2B17.1No

Key Observations:

  1. Infrastructure drives policy quality: ClawGUI-2B (17.1% SR) outperforms the same-scale MAI-UI-2B (11.1% SR) by 6.0% absolute, validating the effectiveness of the ClawGUI-RL infrastructure.
  2. Small well-trained models outperform larger untrained ones: ClawGUI-2B surpasses larger models like Qwen3-VL-32B (11.9%) and UI-Venus-72B (16.4%).
  3. Agentic frameworks remain a separate regime: Proprietary frameworks achieve higher scores but rely on closed-source planners.

4.3 Every Step Counts: Dense Reward Unlocks Better GUI Policies

Table 2: Ablation on reward design on MobileWorld GUI-Only.

MethodReward TypeSR (%)
GRPOBinary (episode-level)14.5
GiGPODense (episode- & step-level)17.1

Replacing episode-level GRPO with step-level GiGPO yields a 2.6% absolute improvement (17.9% relative), confirming the critical value of fine-grained credit assignment via dense step-level supervision.

4.4 Benchmarking the Benchmarks: Can We Trust Published GUI Numbers?

Table 3: Reproduction results across GUI grounding benchmarks. (Excerpt showing key rows)

ModelSS-Pro Off.SS-Pro OursSS-V2 Off.SS-V2 Ours...Repro. Status
GUI-G 247.5047.7593.3093.32...
Qwen3-VL-2B48.5043.90-88.92...
UI-Venus-7B50.8050.4794.1094.03...
MAI-UI-2B57.4057.9492.5092.30...
Gemini 3.0 Pro (Zoom)72.7075.08--...
  • Overall Reproduction Rate: 95.8% (46/48 cells with official baselines).
  • Open-source models: 95.7% reproduction rate.
  • Failure Cases: Two failures (Qwen3-VL-2B, UI-TARS 1.5-7B on SS-Pro) are attributed to undisclosed official evaluation configurations.
  • Closed-source models: Evaluated via a Zoom paradigm (two-stage crop-then-ground strategy), successfully recovering official performance.

Theoretical and Practical Implications

  • Infrastructure as a Catalyst: The work demonstrates that well-engineered, open-source infrastructure is a primary bottleneck and a powerful catalyst for advancing GUI agent capabilities, even at modest model scale.
  • Reproducibility Standard: ClawGUI-Eval provides a community-wide standard for comparable evaluation, showing that discrepancies are an infrastructure problem, not a fundamental limitation.
  • Path to Real-World Impact: ClawGUI-Agent bridges the research-to-user gap, demonstrating a viable path for deploying trained agents into real user workflows with personalization.
  • Convergence of Paradigms: The hybrid CLI-GUI approach and discussion point toward a future unified agentic harness where CLI, GUI, and API calls are interchangeable actions.
  • Foundational for Future Directions: The framework provides the substrate for scaling RL beyond emulators (e.g., via mock apps or on-device training), developing persistent on-device system agents, and training GUI-specific world models for predictive planning.

Conclusion

ClawGUI is a unified open-source framework that integrates online RL training, standardized evaluation, and real-device deployment into a single pipeline for GUI agents. Its components—ClawGUI-RL, ClawGUI-Eval, and ClawGUI-Agent—address critical gaps in the field. The end-to-end trained ClawGUI-2B model validates the framework's effectiveness. The authors hope ClawGUI serves as a foundational platform for the community to build, evaluate, and deploy the next generation of GUI agents, paving the way toward on-device, always-present system intelligence.