Summary (Overview)
- MobileForge is an annotation-free adaptation system for mobile GUI agents that requires no human-written tasks, demonstrations, or reward labels.
- It introduces MobileGym, a unified substrate that grounds task generation, exploration, rollout execution, and hierarchical evaluation in real target-app interaction.
- It proposes HiFPO (Hierarchical Feedback-Guided Policy Optimization), which converts trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates.
- Key results: Using only automatically generated data, MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld (close to the closed-data GUI-Owl-1.5-8B at 69.0%). The adapted ForgeOwl-8B reaches 77.6% Pass@3 on AndroidWorld and 41.0% on out-of-domain MobileWorld GUI-only, establishing the strongest open-data mobile GUI agent in their evaluation.
- The system scales with generated tasks, improves both generalist and GUI-specialized agents, and transfers cross-domain without any MobileWorld training data.
Introduction and Theoretical Foundation
Background
MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but real deployment requires adaptation beyond fixed benchmarks. Mobile apps are numerous and fast-changing, making human-written tasks, expert demonstrations, and manual reward labels costly and quickly stale.
Motivation
Existing annotation-free GUI learning work (TongUI, MobileA3gent, OS-Genesis, GUI-explorer, ZeroGUI, MobileGUI-RL, etc.) reduces manual supervision but suffers from two key bottlenecks:
- Lack of a unified mobile substrate connecting target-app exploration, curriculum mining, rollout execution, and feedback → generated tasks may be weakly grounded and evaluator feedback detached from policy learning.
- Isolated rollouts with sparse reward – even with step-level assessment, current loops rarely combine outcomes, process feedback, and corrective hints to accumulate reusable experience beyond the initial policy’s capability boundary.
Research Question
"Can we build an annotation-free adaptation system for mobile GUI agents that grounds tasks in target-app interaction, generates fine-grained feedback, and converts self-collected experience into policy-improvement signals without human-written tasks, demonstrations, or reward labels?"
Methodology
Problem Setup
Mobile GUI control is modeled as sequential decision making. For attempt on task and step , the policy receives a decision state:
where is the screenshot, is interaction history, and is corrective hint context from earlier attempts. The policy emits a structured GUI action with action type and arguments .
MobileGym: Building an Adaptation Substrate
- Target-App Exploration: Function-aware exploration (inspired by GUI-explorer) using app-level structural anchors (APK activities) combined with screenshots to generate goal-oriented tasks via depth-first traversal. Produces transition records forming evidence pool .
- MobileGym-Curriculum: Converts exploration evidence into executable tasks where is instruction, step budget, core functionality, variation type, prerequisites.
- Hierarchical Rollout Evaluation (MobileGym-Critic): An agentic evaluator that converts execution logs into visual action traces → produces structured JSON verdict with:
- Trajectory outcome label (success/failure)
- Step-level process labels where indicates reasonable/unreasonable step, is rationale
- Corrective hint summarizing key mistakes and suggested alternatives
HiFPO: Feedback-Guided Policy Optimization
-
Hint-Guided Multi-Attempt Rollout: For each task, run attempts serialized so earlier feedback conditions later attempts. Hint context is appended to task instruction before next attempt.
-
Task Filtering: Compute empirical success rate . Remove tasks with (already mastered); retain all-fail and mixed tasks.
-
Trajectory and Step Selection: For each retained task, select one informative attempt (best successful with cleanest process feedback, or best failure). Keep only reasonable steps (where ) as training set .
-
Hint-Contextualized Step-Level GRPO: For each training step, sample candidate responses where is hint-contextualized prompt. Compute adaptive GUI action reward:
where checks action type correctness and checks argument correctness only when type is correct.
Rewards normalized within group to compute advantage:
Loss function with clipped importance ratios and KL regularization:
Empirical Validation / Results
Experimental Protocol
- In-domain: AndroidWorld (116 tasks, Pass@1/2/3)
- Out-of-domain: MobileWorld GUI-only (117 tasks, no training data used)
- Base agents: Qwen3-VL-8B (generalist), GUI-Owl-1.5-8B (GUI-specialized)
- Training data: 3,249 candidate tasks from 20 apps; main results use 900-task subsets
In-Domain Adaptation (AndroidWorld)
| Agent | Tasks | Pass@1 | Pass@2 | Pass@3 | Easy | Medium | Hard |
|---|---|---|---|---|---|---|---|
| Qwen3-VL-8B | 0 | 40.5% | 49.1% | 55.2% | 44.8% | 35.2% | 19.3% |
| ForgeQwen3-8B | 900 | 50.9% | 60.3% | 67.2% | 61.2% | 41.7% | 17.5% |
| GUI-Owl-1.5-8B | 0 | 56.0% | 68.1% | 69.0% | 66.7% | 50.0% | 19.3% |
| ForgeOwl-8B | 900 | 67.2% | 75.0% | 77.6% | 73.2% | 57.4% | 29.8% |
Table 1 summary: ForgeOwl-8B achieves 77.6% Pass@3 (+12.5% relative gain over base). ForgeQwen3-8B narrows gap to GUI-Owl base (67.2% vs 69.0%).
Cross-Domain Generalization (MobileWorld GUI-only)
| Agent | Success Rate |
|---|---|
| GUI-Owl-1.5-32B | 43.9% |
| ForgeOwl-8B (Ours) | 41.0% |
| GUI-Owl-1.5-8B | 37.6% |
| MAI-UI-8B | 27.5% |
| ForgeQwen3-8B (Ours) | 10.3% (+35.5% over Qwen3-VL-8B's 7.6%) |
ForgeOwl-8B surpasses all open-data mobile GUI agents and approaches much larger closed-data models.
Key Ablations
Corrective Hints (Table 3) – Removing hints reduces overall rollout success from 77.0% to 52.0% (25 pp drop) and Pass@3 from 72.5% to 49.0%.
Training Objective (Table 4) – Hint-contextualized GRPO outperforms both no-hint SFT and hint SFT (47.4% vs 45.7% vs 34.5% Pass@1 at 200 tasks; 50.9% vs 47.4% vs 44.0% at 900 tasks).
Task Filtering (Table 5) – Retaining all-fail and mixed tasks ([0.0, 0.9]) gives best combined AndroidWorld + MobileWorld result (48.3% Pass@1, 15/117 MobileWorld).
Evaluator Model (Table 6) – Gemini 2.5 Pro gives best results, but Qwen3-VL-8B as evaluator still improves base policy (44.8% Pass@1 vs 40.5%).
Curriculum Grounding (Table 7) – MobileGym-Curriculum covers broader functions (shopping lists, cooking assistant, meal planner) vs landing-screen baseline which over-concentrates on recipe creation/deletion.
Error Analysis
Largest failure reductions in verification (-38.1 pp), search, complex UI, screen reading, repetition. Hard cases: game-playing, multi-app tasks, memorization/math-counting remain unsolved.
Theoretical and Practical Implications
Theoretical Contributions
- Hierarchical feedback framework: Separates trajectory outcomes, step-level process labels, and corrective hints – each serving different roles in policy learning.
- Hint-contextualized GRPO: Extends group-relative policy optimization to be conditioned on accumulated corrective hints, making step-level advantages depend on reusable feedback rather than just the current state.
- Step-level supervision from trajectories: Demonstrates that long-horizon mobile trajectories can be converted into dense step-level training signals using process feedback, without requiring a learned reward model.
Practical Implications
- Annotation-free adaptation is viable: The system matches or exceeds closed-data GUI-specialized agents using only automatically generated data.
- Cross-domain transfer: Adaptation on AndroidWorld generalizes to MobileWorld without any target-platform data, suggesting the approach captures generalizable GUI skills.
- Open-source release: Code, data, and models will be released at mobile-forge.github.io, enabling further research.
Conclusion
MobileForge addresses two key bottlenecks in annotation-free GUI learning: the lack of a unified adaptation substrate and the weakness of isolated rollouts with coarse feedback. MobileGym grounds task generation and evaluation in real target-app interaction, while HiFPO performs hint-guided multi-attempt rollout and transforms hierarchical feedback into step-level GRPO updates.
The strongest checkpoint, ForgeOwl-8B, achieves 77.6% Pass@3 on AndroidWorld and 41.0% on MobileWorld GUI-only – the strongest open-data mobile GUI agent in their evaluation. The system improves both generalist VLMs and GUI-specialized models.
Future work should extend to broader app ecosystems, longer multi-app workflows, and explicit safety constraints for real user devices.
Related papers
- Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
RNG-Bench reveals top multimodal models struggle with non-Markov memory-for-action, achieving only ~62% on hardest configurations despite fine-tuning improvements.
- DataClaw0: Agentic Tailoring Multimodal Data from Raw Streams
DataClaw 0's Agentic Data Tailoring transforms raw multimodal streams into structured data via a learnable agent, rivaling GPT-4o and Gemini.
- EnterpriseClawBench: Benchmarking Agents from Real Workplace Sessions
EnterpriseClawBench reveals that enterprise agent tasks remain unsaturated (best score 0.663), with performance critically dependent on harness-model combinations, not just the base model.