ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents - Summary

Summary (Overview)

Unified Open-Source Framework: ClawGUI integrates scalable online RL training (ClawGUI-RL), fully standardized evaluation (ClawGUI-Eval), and real-device deployment (ClawGUI-Agent) into a single, coherent pipeline for GUI agent development.
First Open-Source GUI Agent RL Infrastructure: ClawGUI-RL supports training on both parallel virtual environments and real physical devices, integrating the GiGPO algorithm with a Process Reward Model (PRM) for dense step-level supervision.
High-Fidelity Reproducible Evaluation: ClawGUI-Eval enforces a strict Infer → Judge → Metric pipeline across 6 benchmarks and 11+ models, achieving a 95.8% reproduction rate against official baselines, addressing critical reproducibility issues in the field.
Deployment to Real Users: ClawGUI-Agent enables deployment to Android, HarmonyOS, and iOS through 12+ chat platforms, featuring hybrid CLI-GUI control and a persistent personalized memory system.
Empirical Validation: The model ClawGUI-2B, trained end-to-end within the framework, achieves a 17.1% Success Rate on the MobileWorld GUI-Only benchmark, outperforming the same-scale MAI-UI-2B baseline (11.1%) by 6.0% absolute (54% relative).

Introduction and Theoretical Foundation

Graphical User Interface (GUI) agents, which interact with software via visual perception and low-level actions (tap, swipe, type), promise universal digital automation. However, progress is bottlenecked by a lack of coherent full-stack infrastructure rather than model capacity. The paper identifies three critical gaps:

Closed Training Ecosystem: Strong online RL results are reported but underlying infrastructure is not released, and training on real physical devices remains unexplored.
Misaligned Evaluation: Reported numbers across papers are not directly comparable due to undocumented differences in prompts, resolution, and normalization conventions.
Broken Deployment Loop: Trained agents rarely reach end-users. CLI-based agents have limited coverage, while GUI-based agents lack integration into daily workflows.

ClawGUI is proposed to close all three gaps within a single open-source system, providing a unified foundation for the community to build, evaluate, and deploy GUI agents.

Methodology

ClawGUI consists of three integrated modules.

3.2 ClawGUI-RL: Scalable Online RL Training

Environment Manager: Abstracts device backends (virtual emulators, real devices) behind a unified interface. It features:

Virtual Environment: Parallel Docker-based Android emulators with lifecycle management (task reset, evaluation, spare server rotation, teardown).
Real Device Training: Support for physical Android/cloud phones with human-authored tasks and MLLM-based evaluation.

Reward Design: A two-level formulation to address reward sparsity in long-horizon tasks.

Binary Outcome Reward: $R_{\text{outcome}}$ is 1 for task success, 0 for failure.
Dense Step-Level Reward via PRM: A Process Reward Model judges each action's contribution, producing a per-step score $R_{\text{step}}$ .
Combined Reward: The total reward is defined as: $R = R_{\text{outcome}} + R_{\text{step}}$

RL Trainer: Built upon verl and verl-agent, supporting algorithms like Reinforce++, PPO, GSPO, GRPO, and GiGPO.

GRPO (Group Relative Policy Optimization): Estimates advantages by normalizing returns within a group of rollouts sharing the same task. It assigns a uniform episode-level advantage, which is coarse for multi-step GUI tasks.
GiGPO (Group-in-Group Policy Optimization): Employs a two-level hierarchical advantage estimation:
1. Macro-level: Retains relative advantage across complete trajectories.
2. Micro-level: Introduces anchor-state grouping. Steps that encounter the same intermediate state across different rollouts are clustered, and micro relative advantages are estimated within each sub-group via discounted return normalization. This enables fine-grained per-step credit assignment.

3.3 ClawGUI-Eval: Reproducible GUI Evaluation

Benchmark and Model Coverage: Covers 6 benchmarks: ScreenSpot-Pro, ScreenSpot-V2, UI-Vision, MMBench-GUI, OSWorld-G, AndroidControl. Supports 11+ models including Qwen3-VL, UI-TARS, MAI-UI, Gemini, Seed.

Pipeline Architecture: A strict three-stage, decoupled pipeline:

Infer: Generates raw predictions via local GPU (transformers) or remote API inference, with multi-GPU parallelism and shard-level checkpointing.
Judge: Applies benchmark-specific judges (e.g., point-in-box, polygon-aware) to parse outputs and produce per-sample correctness labels.
Metric: Aggregates labels into final accuracy scores with fine-grained breakdowns.

3.4 ClawGUI-Agent: Personal GUI Assistant

Hybrid Device Control: Combines the efficiency of CLI for supported operations with the universal coverage of GUI, addressing the limitations of each paradigm alone.

Personalized Memory: Automatically extracts and stores structured facts (contacts, app usage, preferences) as vector embeddings. Top- $k$ similar memories are retrieved for subsequent tasks, enabling adaptation to individual users.

Deployment Modes:

Remote Control: Users issue tasks via 12+ chat platforms (Feishu, Telegram, etc.) to control a target phone remotely.
Local Control: The agent takes over the local device from a chat app running on the same phone.

ClawGUI-Eval as a Skill: The entire evaluation pipeline can be triggered via a single natural language command through the agent interface.

Empirical Validation / Results

4.2 Main Results

Table 1: Comparison of models on MobileWorld GUI-Only (117 tasks) benchmark.

Model	MobileWorld SR (GUI-Only)	Agentic Framework
Claude-4.5-Sonnet + UI-Ins-7B	47.8	Yes
Gemini-3-Pro + UI-Ins-7B	55.6	Yes
GPT-5 + UI-Ins-7B	54.0	Yes
GUI-Owl-7B	7.7	No
UI-Venus-72B	16.4	No
Qwen3-VL-32B	11.9	No
Doubao-1.5-UI-TARS	26.3	No
MAI-UI-2B	11.1	No
MAI-UI-8B	19.7	No
ClawGUI-2B	17.1	No

Key Observations:

Infrastructure drives policy quality: ClawGUI-2B (17.1% SR) outperforms the same-scale MAI-UI-2B (11.1% SR) by 6.0% absolute, validating the effectiveness of the ClawGUI-RL infrastructure.
Small well-trained models outperform larger untrained ones: ClawGUI-2B surpasses larger models like Qwen3-VL-32B (11.9%) and UI-Venus-72B (16.4%).
Agentic frameworks remain a separate regime: Proprietary frameworks achieve higher scores but rely on closed-source planners.

4.3 Every Step Counts: Dense Reward Unlocks Better GUI Policies

Table 2: Ablation on reward design on MobileWorld GUI-Only.

Method	Reward Type	SR (%)
GRPO	Binary (episode-level)	14.5
GiGPO	Dense (episode- & step-level)	17.1

Replacing episode-level GRPO with step-level GiGPO yields a 2.6% absolute improvement (17.9% relative), confirming the critical value of fine-grained credit assignment via dense step-level supervision.

4.4 Benchmarking the Benchmarks: Can We Trust Published GUI Numbers?

Table 3: Reproduction results across GUI grounding benchmarks. (Excerpt showing key rows)

Model	SS-Pro Off.	SS-Pro Ours	SS-V2 Off.	SS-V2 Ours	...	Repro. Status
GUI-G 2	47.50	47.75	93.30	93.32	...	✓
Qwen3-VL-2B	48.50	43.90	-	88.92	...	✗
UI-Venus-7B	50.80	50.47	94.10	94.03	...	✓
MAI-UI-2B	57.40	57.94	92.50	92.30	...	✓
Gemini 3.0 Pro (Zoom)	72.70	75.08	-	-	...	✓

Overall Reproduction Rate: 95.8% (46/48 cells with official baselines).
Open-source models: 95.7% reproduction rate.
Failure Cases: Two failures (Qwen3-VL-2B, UI-TARS 1.5-7B on SS-Pro) are attributed to undisclosed official evaluation configurations.
Closed-source models: Evaluated via a Zoom paradigm (two-stage crop-then-ground strategy), successfully recovering official performance.

Theoretical and Practical Implications

Infrastructure as a Catalyst: The work demonstrates that well-engineered, open-source infrastructure is a primary bottleneck and a powerful catalyst for advancing GUI agent capabilities, even at modest model scale.
Reproducibility Standard: ClawGUI-Eval provides a community-wide standard for comparable evaluation, showing that discrepancies are an infrastructure problem, not a fundamental limitation.
Path to Real-World Impact: ClawGUI-Agent bridges the research-to-user gap, demonstrating a viable path for deploying trained agents into real user workflows with personalization.
Convergence of Paradigms: The hybrid CLI-GUI approach and discussion point toward a future unified agentic harness where CLI, GUI, and API calls are interchangeable actions.
Foundational for Future Directions: The framework provides the substrate for scaling RL beyond emulators (e.g., via mock apps or on-device training), developing persistent on-device system agents, and training GUI-specific world models for predictive planning.

Conclusion

ClawGUI is a unified open-source framework that integrates online RL training, standardized evaluation, and real-device deployment into a single pipeline for GUI agents. Its components—ClawGUI-RL, ClawGUI-Eval, and ClawGUI-Agent—address critical gaps in the field. The end-to-end trained ClawGUI-2B model validates the framework's effectiveness. The authors hope ClawGUI serves as a foundational platform for the community to build, evaluate, and deploy the next generation of GUI agents, paving the way toward on-device, always-present system intelligence.