Visual Summary | MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

Summary (Overview)

MobileForge is an annotation-free adaptation system for mobile GUI agents that requires no human-written tasks, demonstrations, or reward labels.
It introduces MobileGym, a unified substrate that grounds task generation, exploration, rollout execution, and hierarchical evaluation in real target-app interaction.
It proposes HiFPO (Hierarchical Feedback-Guided Policy Optimization), which converts trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates.
Key results: Using only automatically generated data, MobileForge adapts Qwen3-VL-8B to 67.2% Pass@3 on AndroidWorld (close to the closed-data GUI-Owl-1.5-8B at 69.0%). The adapted ForgeOwl-8B reaches 77.6% Pass@3 on AndroidWorld and 41.0% on out-of-domain MobileWorld GUI-only, establishing the strongest open-data mobile GUI agent in their evaluation.
The system scales with generated tasks, improves both generalist and GUI-specialized agents, and transfers cross-domain without any MobileWorld training data.

Introduction and Theoretical Foundation

Background

MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but real deployment requires adaptation beyond fixed benchmarks. Mobile apps are numerous and fast-changing, making human-written tasks, expert demonstrations, and manual reward labels costly and quickly stale.

Motivation

Existing annotation-free GUI learning work (TongUI, MobileA3gent, OS-Genesis, GUI-explorer, ZeroGUI, MobileGUI-RL, etc.) reduces manual supervision but suffers from two key bottlenecks:

Lack of a unified mobile substrate connecting target-app exploration, curriculum mining, rollout execution, and feedback → generated tasks may be weakly grounded and evaluator feedback detached from policy learning.
Isolated rollouts with sparse reward – even with step-level assessment, current loops rarely combine outcomes, process feedback, and corrective hints to accumulate reusable experience beyond the initial policy’s capability boundary.

Research Question

"Can we build an annotation-free adaptation system for mobile GUI agents that grounds tasks in target-app interaction, generates fine-grained feedback, and converts self-collected experience into policy-improvement signals without human-written tasks, demonstrations, or reward labels?"

Methodology

Problem Setup

Mobile GUI control is modeled as sequential decision making. For attempt $k$ on task $x$ and step $t$ , the policy receives a decision state:

s_k^{(t)} = (x, I_k^{(t)}, H_k^{(t)}, \eta_{<k})

where $I_k^{(t)}$ is the screenshot, $H_k^{(t)}$ is interaction history, and $\eta_{<k}$ is corrective hint context from earlier attempts. The policy emits a structured GUI action $a_k^{(t)} = (\alpha_k^{(t)}, \psi_k^{(t)}) \sim \pi_\theta(\cdot|s_k^{(t)})$ with action type $\alpha$ and arguments $\psi$ .

MobileGym: Building an Adaptation Substrate

Target-App Exploration: Function-aware exploration (inspired by GUI-explorer) using app-level structural anchors (APK activities) combined with screenshots to generate goal-oriented tasks via depth-first traversal. Produces transition records forming evidence pool $Z$ .
MobileGym-Curriculum: Converts exploration evidence into executable tasks $x = (\iota, B, c, v, p)$ where $\iota$ is instruction, $B$ step budget, $c$ core functionality, $v$ variation type, $p$ prerequisites.
Hierarchical Rollout Evaluation (MobileGym-Critic): An agentic evaluator that converts execution logs into visual action traces → produces structured JSON verdict with:
- Trajectory outcome label $z_k \in \{0,1\}$ (success/failure)
- Step-level process labels $\ell_k^{(t)} = (v_k^{(t)}, e_k^{(t)})$ where $v_k^{(t)} \in \{0,1\}$ indicates reasonable/unreasonable step, $e_k^{(t)}$ is rationale
- Corrective hint $h_k$ summarizing key mistakes and suggested alternatives

HiFPO: Feedback-Guided Policy Optimization

Hint-Guided Multi-Attempt Rollout: For each task, run $K$ attempts serialized so earlier feedback conditions later attempts. Hint context $\eta_{<k} = \text{Aggregate}(h_1, \dots, h_{k-1})$ is appended to task instruction before next attempt.
Task Filtering: Compute empirical success rate $SR(x) = \frac{1}{K}\sum_{k=1}^K z_k$ . Remove tasks with $SR(x)=1$ (already mastered); retain all-fail and mixed tasks.
Trajectory and Step Selection: For each retained task, select one informative attempt (best successful with cleanest process feedback, or best failure). Keep only reasonable steps (where $v_k^{(t)}=1$ ) as training set $D$ .
Hint-Contextualized Step-Level GRPO: For each training step, sample $G$ candidate responses $\hat{o}_{j,g} \sim \pi_{\theta_{\text{old}}}(\cdot|\tilde{s}_j)$ where $\tilde{s}_j$ is hint-contextualized prompt. Compute adaptive GUI action reward:
$R_{j,g} = \lambda_{\text{type}} r_{j,g}^{\text{type}} + \lambda_{\text{arg}} r_{j,g}^{\text{arg}}$
where $r^{\text{type}}$ checks action type correctness and $r^{\text{arg}}$ checks argument correctness only when type is correct.

Rewards normalized within group to compute advantage:
$A_{j,g} = \frac{R_{j,g} - \mu_j}{\sigma_j + \epsilon_{\text{std}}}$
Loss function with clipped importance ratios and KL regularization:
$\mathcal{L}_{\text{HiFPO}}(\theta) = -\mathbb{E}_{j,g}[\min(\rho_{j,g}A_{j,g}, \bar{\rho}_{j,g}A_{j,g})] + \beta \mathbb{E}_j[D_{\text{KL}}^j(\theta)]$

Empirical Validation / Results

Experimental Protocol

In-domain: AndroidWorld (116 tasks, Pass@1/2/3)
Out-of-domain: MobileWorld GUI-only (117 tasks, no training data used)
Base agents: Qwen3-VL-8B (generalist), GUI-Owl-1.5-8B (GUI-specialized)
Training data: 3,249 candidate tasks from 20 apps; main results use 900-task subsets

In-Domain Adaptation (AndroidWorld)

Agent	Tasks	Pass@1	Pass@2	Pass@3	Easy	Medium	Hard
Qwen3-VL-8B	0	40.5%	49.1%	55.2%	44.8%	35.2%	19.3%
ForgeQwen3-8B	900	50.9%	60.3%	67.2%	61.2%	41.7%	17.5%
GUI-Owl-1.5-8B	0	56.0%	68.1%	69.0%	66.7%	50.0%	19.3%
ForgeOwl-8B	900	67.2%	75.0%	77.6%	73.2%	57.4%	29.8%

Table 1 summary: ForgeOwl-8B achieves 77.6% Pass@3 (+12.5% relative gain over base). ForgeQwen3-8B narrows gap to GUI-Owl base (67.2% vs 69.0%).

Cross-Domain Generalization (MobileWorld GUI-only)

Agent	Success Rate
GUI-Owl-1.5-32B	43.9%
ForgeOwl-8B (Ours)	41.0%
GUI-Owl-1.5-8B	37.6%
MAI-UI-8B	27.5%
ForgeQwen3-8B (Ours)	10.3% (+35.5% over Qwen3-VL-8B's 7.6%)

ForgeOwl-8B surpasses all open-data mobile GUI agents and approaches much larger closed-data models.

Key Ablations

Corrective Hints (Table 3) – Removing hints reduces overall rollout success from 77.0% to 52.0% (25 pp drop) and Pass@3 from 72.5% to 49.0%.

Training Objective (Table 4) – Hint-contextualized GRPO outperforms both no-hint SFT and hint SFT (47.4% vs 45.7% vs 34.5% Pass@1 at 200 tasks; 50.9% vs 47.4% vs 44.0% at 900 tasks).

Task Filtering (Table 5) – Retaining all-fail and mixed tasks ([0.0, 0.9]) gives best combined AndroidWorld + MobileWorld result (48.3% Pass@1, 15/117 MobileWorld).

Evaluator Model (Table 6) – Gemini 2.5 Pro gives best results, but Qwen3-VL-8B as evaluator still improves base policy (44.8% Pass@1 vs 40.5%).

Curriculum Grounding (Table 7) – MobileGym-Curriculum covers broader functions (shopping lists, cooking assistant, meal planner) vs landing-screen baseline which over-concentrates on recipe creation/deletion.

Error Analysis

Largest failure reductions in verification (-38.1 pp), search, complex UI, screen reading, repetition. Hard cases: game-playing, multi-app tasks, memorization/math-counting remain unsolved.

Theoretical and Practical Implications

Theoretical Contributions

Hierarchical feedback framework: Separates trajectory outcomes, step-level process labels, and corrective hints – each serving different roles in policy learning.
Hint-contextualized GRPO: Extends group-relative policy optimization to be conditioned on accumulated corrective hints, making step-level advantages depend on reusable feedback rather than just the current state.
Step-level supervision from trajectories: Demonstrates that long-horizon mobile trajectories can be converted into dense step-level training signals using process feedback, without requiring a learned reward model.

Practical Implications

Annotation-free adaptation is viable: The system matches or exceeds closed-data GUI-specialized agents using only automatically generated data.
Cross-domain transfer: Adaptation on AndroidWorld generalizes to MobileWorld without any target-platform data, suggesting the approach captures generalizable GUI skills.
Open-source release: Code, data, and models will be released at mobile-forge.github.io, enabling further research.

Conclusion

MobileForge addresses two key bottlenecks in annotation-free GUI learning: the lack of a unified adaptation substrate and the weakness of isolated rollouts with coarse feedback. MobileGym grounds task generation and evaluation in real target-app interaction, while HiFPO performs hint-guided multi-attempt rollout and transforms hierarchical feedback into step-level GRPO updates.

The strongest checkpoint, ForgeOwl-8B, achieves 77.6% Pass@3 on AndroidWorld and 41.0% on MobileWorld GUI-only – the strongest open-data mobile GUI agent in their evaluation. The system improves both generalist VLMs and GUI-specialized models.

Future work should extend to broader app ecosystems, longer multi-app workflows, and explicit safety constraints for real user devices.