Summary (Overview)

  • MobileForge is an annotation-free adaptation system for mobile GUI agents that requires no human-written tasks, demonstrations, or reward labels.
  • It introduces MobileGym, a unified substrate that grounds task generation, exploration, rollout execution, and hierarchical evaluation in real target-app interaction.
  • It proposes HiFPO (Hierarchical Feedback-Guided Policy Optimization), which converts trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates.
  • Key results: Using only automatically generated data, MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld (close to the closed-data GUI-Owl-1.5-8B at 69.0%). The adapted ForgeOwl-8B reaches 77.6% Pass@3 on AndroidWorld and 41.0% on out-of-domain MobileWorld GUI-only, establishing the strongest open-data mobile GUI agent in their evaluation.
  • The system scales with generated tasks, improves both generalist and GUI-specialized agents, and transfers cross-domain without any MobileWorld training data.

Introduction and Theoretical Foundation

Background

MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but real deployment requires adaptation beyond fixed benchmarks. Mobile apps are numerous and fast-changing, making human-written tasks, expert demonstrations, and manual reward labels costly and quickly stale.

Motivation

Existing annotation-free GUI learning work (TongUI, MobileA3gent, OS-Genesis, GUI-explorer, ZeroGUI, MobileGUI-RL, etc.) reduces manual supervision but suffers from two key bottlenecks:

  1. Lack of a unified mobile substrate connecting target-app exploration, curriculum mining, rollout execution, and feedback → generated tasks may be weakly grounded and evaluator feedback detached from policy learning.
  2. Isolated rollouts with sparse reward – even with step-level assessment, current loops rarely combine outcomes, process feedback, and corrective hints to accumulate reusable experience beyond the initial policy’s capability boundary.

Research Question

"Can we build an annotation-free adaptation system for mobile GUI agents that grounds tasks in target-app interaction, generates fine-grained feedback, and converts self-collected experience into policy-improvement signals without human-written tasks, demonstrations, or reward labels?"

Methodology

Problem Setup

Mobile GUI control is modeled as sequential decision making. For attempt kk on task xx and step tt, the policy receives a decision state:

sk(t)=(x,Ik(t),Hk(t),η<k)s_k^{(t)} = (x, I_k^{(t)}, H_k^{(t)}, \eta_{<k})

where Ik(t)I_k^{(t)} is the screenshot, Hk(t)H_k^{(t)} is interaction history, and η<k\eta_{<k} is corrective hint context from earlier attempts. The policy emits a structured GUI action ak(t)=(αk(t),ψk(t))πθ(sk(t))a_k^{(t)} = (\alpha_k^{(t)}, \psi_k^{(t)}) \sim \pi_\theta(\cdot|s_k^{(t)}) with action type α\alpha and arguments ψ\psi.

MobileGym: Building an Adaptation Substrate

  1. Target-App Exploration: Function-aware exploration (inspired by GUI-explorer) using app-level structural anchors (APK activities) combined with screenshots to generate goal-oriented tasks via depth-first traversal. Produces transition records forming evidence pool ZZ.
  2. MobileGym-Curriculum: Converts exploration evidence into executable tasks x=(ι,B,c,v,p)x = (\iota, B, c, v, p) where ι\iota is instruction, BB step budget, cc core functionality, vv variation type, pp prerequisites.
  3. Hierarchical Rollout Evaluation (MobileGym-Critic): An agentic evaluator that converts execution logs into visual action traces → produces structured JSON verdict with:
    • Trajectory outcome label zk{0,1}z_k \in \{0,1\} (success/failure)
    • Step-level process labels k(t)=(vk(t),ek(t))\ell_k^{(t)} = (v_k^{(t)}, e_k^{(t)}) where vk(t){0,1}v_k^{(t)} \in \{0,1\} indicates reasonable/unreasonable step, ek(t)e_k^{(t)} is rationale
    • Corrective hint hkh_k summarizing key mistakes and suggested alternatives

HiFPO: Feedback-Guided Policy Optimization

  1. Hint-Guided Multi-Attempt Rollout: For each task, run KK attempts serialized so earlier feedback conditions later attempts. Hint context η<k=Aggregate(h1,,hk1)\eta_{<k} = \text{Aggregate}(h_1, \dots, h_{k-1}) is appended to task instruction before next attempt.

  2. Task Filtering: Compute empirical success rate SR(x)=1Kk=1KzkSR(x) = \frac{1}{K}\sum_{k=1}^K z_k. Remove tasks with SR(x)=1SR(x)=1 (already mastered); retain all-fail and mixed tasks.

  3. Trajectory and Step Selection: For each retained task, select one informative attempt (best successful with cleanest process feedback, or best failure). Keep only reasonable steps (where vk(t)=1v_k^{(t)}=1) as training set DD.

  4. Hint-Contextualized Step-Level GRPO: For each training step, sample GG candidate responses o^j,gπθold(s~j)\hat{o}_{j,g} \sim \pi_{\theta_{\text{old}}}(\cdot|\tilde{s}_j) where s~j\tilde{s}_j is hint-contextualized prompt. Compute adaptive GUI action reward:

    Rj,g=λtyperj,gtype+λargrj,gargR_{j,g} = \lambda_{\text{type}} r_{j,g}^{\text{type}} + \lambda_{\text{arg}} r_{j,g}^{\text{arg}}

    where rtyper^{\text{type}} checks action type correctness and rargr^{\text{arg}} checks argument correctness only when type is correct.

    Rewards normalized within group to compute advantage:

    Aj,g=Rj,gμjσj+ϵstdA_{j,g} = \frac{R_{j,g} - \mu_j}{\sigma_j + \epsilon_{\text{std}}}

    Loss function with clipped importance ratios and KL regularization:

    LHiFPO(θ)=Ej,g[min(ρj,gAj,g,ρˉj,gAj,g)]+βEj[DKLj(θ)]\mathcal{L}_{\text{HiFPO}}(\theta) = -\mathbb{E}_{j,g}[\min(\rho_{j,g}A_{j,g}, \bar{\rho}_{j,g}A_{j,g})] + \beta \mathbb{E}_j[D_{\text{KL}}^j(\theta)]

Empirical Validation / Results

Experimental Protocol

  • In-domain: AndroidWorld (116 tasks, Pass@1/2/3)
  • Out-of-domain: MobileWorld GUI-only (117 tasks, no training data used)
  • Base agents: Qwen3-VL-8B (generalist), GUI-Owl-1.5-8B (GUI-specialized)
  • Training data: 3,249 candidate tasks from 20 apps; main results use 900-task subsets

In-Domain Adaptation (AndroidWorld)

AgentTasksPass@1Pass@2Pass@3EasyMediumHard
Qwen3-VL-8B040.5%49.1%55.2%44.8%35.2%19.3%
ForgeQwen3-8B90050.9%60.3%67.2%61.2%41.7%17.5%
GUI-Owl-1.5-8B056.0%68.1%69.0%66.7%50.0%19.3%
ForgeOwl-8B90067.2%75.0%77.6%73.2%57.4%29.8%

Table 1 summary: ForgeOwl-8B achieves 77.6% Pass@3 (+12.5% relative gain over base). ForgeQwen3-8B narrows gap to GUI-Owl base (67.2% vs 69.0%).

Cross-Domain Generalization (MobileWorld GUI-only)

AgentSuccess Rate
GUI-Owl-1.5-32B43.9%
ForgeOwl-8B (Ours)41.0%
GUI-Owl-1.5-8B37.6%
MAI-UI-8B27.5%
ForgeQwen3-8B (Ours)10.3% (+35.5% over Qwen3-VL-8B's 7.6%)

ForgeOwl-8B surpasses all open-data mobile GUI agents and approaches much larger closed-data models.

Key Ablations

Corrective Hints (Table 3) – Removing hints reduces overall rollout success from 77.0% to 52.0% (25 pp drop) and Pass@3 from 72.5% to 49.0%.

Training Objective (Table 4) – Hint-contextualized GRPO outperforms both no-hint SFT and hint SFT (47.4% vs 45.7% vs 34.5% Pass@1 at 200 tasks; 50.9% vs 47.4% vs 44.0% at 900 tasks).

Task Filtering (Table 5) – Retaining all-fail and mixed tasks ([0.0, 0.9]) gives best combined AndroidWorld + MobileWorld result (48.3% Pass@1, 15/117 MobileWorld).

Evaluator Model (Table 6) – Gemini 2.5 Pro gives best results, but Qwen3-VL-8B as evaluator still improves base policy (44.8% Pass@1 vs 40.5%).

Curriculum Grounding (Table 7) – MobileGym-Curriculum covers broader functions (shopping lists, cooking assistant, meal planner) vs landing-screen baseline which over-concentrates on recipe creation/deletion.

Error Analysis

Largest failure reductions in verification (-38.1 pp), search, complex UI, screen reading, repetition. Hard cases: game-playing, multi-app tasks, memorization/math-counting remain unsolved.

Theoretical and Practical Implications

Theoretical Contributions

  1. Hierarchical feedback framework: Separates trajectory outcomes, step-level process labels, and corrective hints – each serving different roles in policy learning.
  2. Hint-contextualized GRPO: Extends group-relative policy optimization to be conditioned on accumulated corrective hints, making step-level advantages depend on reusable feedback rather than just the current state.
  3. Step-level supervision from trajectories: Demonstrates that long-horizon mobile trajectories can be converted into dense step-level training signals using process feedback, without requiring a learned reward model.

Practical Implications

  • Annotation-free adaptation is viable: The system matches or exceeds closed-data GUI-specialized agents using only automatically generated data.
  • Cross-domain transfer: Adaptation on AndroidWorld generalizes to MobileWorld without any target-platform data, suggesting the approach captures generalizable GUI skills.
  • Open-source release: Code, data, and models will be released at mobile-forge.github.io, enabling further research.

Conclusion

MobileForge addresses two key bottlenecks in annotation-free GUI learning: the lack of a unified adaptation substrate and the weakness of isolated rollouts with coarse feedback. MobileGym grounds task generation and evaluation in real target-app interaction, while HiFPO performs hint-guided multi-attempt rollout and transforms hierarchical feedback into step-level GRPO updates.

The strongest checkpoint, ForgeOwl-8B, achieves 77.6% Pass@3 on AndroidWorld and 41.0% on MobileWorld GUI-only – the strongest open-data mobile GUI agent in their evaluation. The system improves both generalist VLMs and GUI-specialized models.

Future work should extend to broader app ecosystems, longer multi-app workflows, and explicit safety constraints for real user devices.

Related papers