ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents - Summary
Summary (Overview)
- Unified Open-Source Framework: ClawGUI integrates scalable online RL training (ClawGUI-RL), fully standardized evaluation (ClawGUI-Eval), and real-device deployment (ClawGUI-Agent) into a single, coherent pipeline for GUI agent development.
- First Open-Source GUI Agent RL Infrastructure: ClawGUI-RL supports training on both parallel virtual environments and real physical devices, integrating the GiGPO algorithm with a Process Reward Model (PRM) for dense step-level supervision.
- High-Fidelity Reproducible Evaluation: ClawGUI-Eval enforces a strict Infer → Judge → Metric pipeline across 6 benchmarks and 11+ models, achieving a 95.8% reproduction rate against official baselines, addressing critical reproducibility issues in the field.
- Deployment to Real Users: ClawGUI-Agent enables deployment to Android, HarmonyOS, and iOS through 12+ chat platforms, featuring hybrid CLI-GUI control and a persistent personalized memory system.
- Empirical Validation: The model ClawGUI-2B, trained end-to-end within the framework, achieves a 17.1% Success Rate on the MobileWorld GUI-Only benchmark, outperforming the same-scale MAI-UI-2B baseline (11.1%) by 6.0% absolute (54% relative).
Introduction and Theoretical Foundation
Graphical User Interface (GUI) agents, which interact with software via visual perception and low-level actions (tap, swipe, type), promise universal digital automation. However, progress is bottlenecked by a lack of coherent full-stack infrastructure rather than model capacity. The paper identifies three critical gaps:
- Closed Training Ecosystem: Strong online RL results are reported but underlying infrastructure is not released, and training on real physical devices remains unexplored.
- Misaligned Evaluation: Reported numbers across papers are not directly comparable due to undocumented differences in prompts, resolution, and normalization conventions.
- Broken Deployment Loop: Trained agents rarely reach end-users. CLI-based agents have limited coverage, while GUI-based agents lack integration into daily workflows.
ClawGUI is proposed to close all three gaps within a single open-source system, providing a unified foundation for the community to build, evaluate, and deploy GUI agents.
Methodology
ClawGUI consists of three integrated modules.
3.2 ClawGUI-RL: Scalable Online RL Training
Environment Manager: Abstracts device backends (virtual emulators, real devices) behind a unified interface. It features:
- Virtual Environment: Parallel Docker-based Android emulators with lifecycle management (task reset, evaluation, spare server rotation, teardown).
- Real Device Training: Support for physical Android/cloud phones with human-authored tasks and MLLM-based evaluation.
Reward Design: A two-level formulation to address reward sparsity in long-horizon tasks.
- Binary Outcome Reward: is 1 for task success, 0 for failure.
- Dense Step-Level Reward via PRM: A Process Reward Model judges each action's contribution, producing a per-step score .
- Combined Reward: The total reward is defined as:
RL Trainer: Built upon verl and verl-agent, supporting algorithms like Reinforce++, PPO, GSPO, GRPO, and GiGPO.
- GRPO (Group Relative Policy Optimization): Estimates advantages by normalizing returns within a group of rollouts sharing the same task. It assigns a uniform episode-level advantage, which is coarse for multi-step GUI tasks.
- GiGPO (Group-in-Group Policy Optimization): Employs a two-level hierarchical advantage estimation:
- Macro-level: Retains relative advantage across complete trajectories.
- Micro-level: Introduces anchor-state grouping. Steps that encounter the same intermediate state across different rollouts are clustered, and micro relative advantages are estimated within each sub-group via discounted return normalization. This enables fine-grained per-step credit assignment.
3.3 ClawGUI-Eval: Reproducible GUI Evaluation
Benchmark and Model Coverage: Covers 6 benchmarks: ScreenSpot-Pro, ScreenSpot-V2, UI-Vision, MMBench-GUI, OSWorld-G, AndroidControl. Supports 11+ models including Qwen3-VL, UI-TARS, MAI-UI, Gemini, Seed.
Pipeline Architecture: A strict three-stage, decoupled pipeline:
- Infer: Generates raw predictions via local GPU (
transformers) or remote API inference, with multi-GPU parallelism and shard-level checkpointing. - Judge: Applies benchmark-specific judges (e.g., point-in-box, polygon-aware) to parse outputs and produce per-sample correctness labels.
- Metric: Aggregates labels into final accuracy scores with fine-grained breakdowns.
3.4 ClawGUI-Agent: Personal GUI Assistant
Hybrid Device Control: Combines the efficiency of CLI for supported operations with the universal coverage of GUI, addressing the limitations of each paradigm alone.
Personalized Memory: Automatically extracts and stores structured facts (contacts, app usage, preferences) as vector embeddings. Top- similar memories are retrieved for subsequent tasks, enabling adaptation to individual users.
Deployment Modes:
- Remote Control: Users issue tasks via 12+ chat platforms (Feishu, Telegram, etc.) to control a target phone remotely.
- Local Control: The agent takes over the local device from a chat app running on the same phone.
ClawGUI-Eval as a Skill: The entire evaluation pipeline can be triggered via a single natural language command through the agent interface.
Empirical Validation / Results
4.2 Main Results
Table 1: Comparison of models on MobileWorld GUI-Only (117 tasks) benchmark.
| Model | MobileWorld SR (GUI-Only) | Agentic Framework |
|---|---|---|
| Claude-4.5-Sonnet + UI-Ins-7B | 47.8 | Yes |
| Gemini-3-Pro + UI-Ins-7B | 55.6 | Yes |
| GPT-5 + UI-Ins-7B | 54.0 | Yes |
| GUI-Owl-7B | 7.7 | No |
| UI-Venus-72B | 16.4 | No |
| Qwen3-VL-32B | 11.9 | No |
| Doubao-1.5-UI-TARS | 26.3 | No |
| MAI-UI-2B | 11.1 | No |
| MAI-UI-8B | 19.7 | No |
| ClawGUI-2B | 17.1 | No |
Key Observations:
- Infrastructure drives policy quality: ClawGUI-2B (17.1% SR) outperforms the same-scale MAI-UI-2B (11.1% SR) by 6.0% absolute, validating the effectiveness of the ClawGUI-RL infrastructure.
- Small well-trained models outperform larger untrained ones: ClawGUI-2B surpasses larger models like Qwen3-VL-32B (11.9%) and UI-Venus-72B (16.4%).
- Agentic frameworks remain a separate regime: Proprietary frameworks achieve higher scores but rely on closed-source planners.
4.3 Every Step Counts: Dense Reward Unlocks Better GUI Policies
Table 2: Ablation on reward design on MobileWorld GUI-Only.
| Method | Reward Type | SR (%) |
|---|---|---|
| GRPO | Binary (episode-level) | 14.5 |
| GiGPO | Dense (episode- & step-level) | 17.1 |
Replacing episode-level GRPO with step-level GiGPO yields a 2.6% absolute improvement (17.9% relative), confirming the critical value of fine-grained credit assignment via dense step-level supervision.
4.4 Benchmarking the Benchmarks: Can We Trust Published GUI Numbers?
Table 3: Reproduction results across GUI grounding benchmarks. (Excerpt showing key rows)
| Model | SS-Pro Off. | SS-Pro Ours | SS-V2 Off. | SS-V2 Ours | ... | Repro. Status |
|---|---|---|---|---|---|---|
| GUI-G 2 | 47.50 | 47.75 | 93.30 | 93.32 | ... | ✓ |
| Qwen3-VL-2B | 48.50 | 43.90 | - | 88.92 | ... | ✗ |
| UI-Venus-7B | 50.80 | 50.47 | 94.10 | 94.03 | ... | ✓ |
| MAI-UI-2B | 57.40 | 57.94 | 92.50 | 92.30 | ... | ✓ |
| Gemini 3.0 Pro (Zoom) | 72.70 | 75.08 | - | - | ... | ✓ |
- Overall Reproduction Rate: 95.8% (46/48 cells with official baselines).
- Open-source models: 95.7% reproduction rate.
- Failure Cases: Two failures (Qwen3-VL-2B, UI-TARS 1.5-7B on SS-Pro) are attributed to undisclosed official evaluation configurations.
- Closed-source models: Evaluated via a Zoom paradigm (two-stage crop-then-ground strategy), successfully recovering official performance.
Theoretical and Practical Implications
- Infrastructure as a Catalyst: The work demonstrates that well-engineered, open-source infrastructure is a primary bottleneck and a powerful catalyst for advancing GUI agent capabilities, even at modest model scale.
- Reproducibility Standard: ClawGUI-Eval provides a community-wide standard for comparable evaluation, showing that discrepancies are an infrastructure problem, not a fundamental limitation.
- Path to Real-World Impact: ClawGUI-Agent bridges the research-to-user gap, demonstrating a viable path for deploying trained agents into real user workflows with personalization.
- Convergence of Paradigms: The hybrid CLI-GUI approach and discussion point toward a future unified agentic harness where CLI, GUI, and API calls are interchangeable actions.
- Foundational for Future Directions: The framework provides the substrate for scaling RL beyond emulators (e.g., via mock apps or on-device training), developing persistent on-device system agents, and training GUI-specific world models for predictive planning.
Conclusion
ClawGUI is a unified open-source framework that integrates online RL training, standardized evaluation, and real-device deployment into a single pipeline for GUI agents. Its components—ClawGUI-RL, ClawGUI-Eval, and ClawGUI-Agent—address critical gaps in the field. The end-to-end trained ClawGUI-2B model validates the framework's effectiveness. The authors hope ClawGUI serves as a foundational platform for the community to build, evaluate, and deploy the next generation of GUI agents, paving the way toward on-device, always-present system intelligence.