# MobileForge: Annotation-Free Adaptation for Mobile GUI Agents with Hierarchical Feedback-Guided Policy Optimization

> MobileForge achieves state-of-the-art open-data mobile GUI agent performance at 77.6% Pass@3 on AndroidWorld with zero human annotation.

- **Source:** [arXiv](https://arxiv.org/abs/2606.19930)
- **Published:** 2026-06-25
- **Permalink:** https://picx.dev/p/g53rfs
- **Whiteboard:** https://picx.dev/p/g53rfs/image

## Summary

## Summary (Overview)

- MobileForge is an **annotation-free adaptation system** for mobile GUI agents that requires no human-written tasks, demonstrations, or reward labels.
- It introduces **MobileGym**, a unified substrate that grounds task generation, exploration, rollout execution, and hierarchical evaluation in real target-app interaction.
- It proposes **HiFPO** (Hierarchical Feedback-Guided Policy Optimization), which converts trajectory outcomes, step-level process feedback, and corrective hints into hint-contextualized step-level GRPO updates.
- **Key results**: Using only automatically generated data, MobileForge adapts Qwen3-VL-8B to **67.2% Pass@3** on AndroidWorld (close to the closed-data GUI-Owl-1.5-8B at 69.0%). The adapted **ForgeOwl-8B** reaches **77.6% Pass@3** on AndroidWorld and **41.0%** on out-of-domain MobileWorld GUI-only, establishing the strongest open-data mobile GUI agent in their evaluation.
- The system scales with generated tasks, improves both generalist and GUI-specialized agents, and transfers cross-domain without any MobileWorld training data.

## Introduction and Theoretical Foundation

### Background
MLLM-based mobile GUI agents have made substantial progress in UI understanding and action execution, but real deployment requires adaptation beyond fixed benchmarks. Mobile apps are numerous and fast-changing, making human-written tasks, expert demonstrations, and manual reward labels costly and quickly stale.

### Motivation
Existing annotation-free GUI learning work (TongUI, MobileA3gent, OS-Genesis, GUI-explorer, ZeroGUI, MobileGUI-RL, etc.) reduces manual supervision but suffers from two key bottlenecks:
1. **Lack of a unified mobile substrate** connecting target-app exploration, curriculum mining, rollout execution, and feedback → generated tasks may be weakly grounded and evaluator feedback detached from policy learning.
2. **Isolated rollouts with sparse reward** – even with step-level assessment, current loops rarely combine outcomes, process feedback, and corrective hints to accumulate reusable experience beyond the initial policy’s capability boundary.

### Research Question
> "Can we build an annotation-free adaptation system for mobile GUI agents that grounds tasks in target-app interaction, generates fine-grained feedback, and converts self-collected experience into policy-improvement signals without human-written tasks, demonstrations, or reward labels?"

## Methodology

### Problem Setup
Mobile GUI control is modeled as sequential decision making. For attempt $k$ on task $x$ and step $t$, the policy receives a decision state:
$$ s_k^{(t)} = (x, I_k^{(t)}, H_k^{(t)}, \eta_{<k}) $$
where $I_k^{(t)}$ is the screenshot, $H_k^{(t)}$ is interaction history, and $\eta_{<k}$ is corrective hint context from earlier attempts. The policy emits a structured GUI action $a_k^{(t)} = (\alpha_k^{(t)}, \psi_k^{(t)}) \sim \pi_\theta(\cdot|s_k^{(t)})$ with action type $\alpha$ and arguments $\psi$.

### MobileGym: Building an Adaptation Substrate
1. **Target-App Exploration**: Function-aware exploration (inspired by GUI-explorer) using app-level structural anchors (APK activities) combined with screenshots to generate goal-oriented tasks via depth-first traversal. Produces transition records forming evidence pool $Z$.
2. **MobileGym-Curriculum**: Converts exploration evidence into executable tasks $x = (\iota, B, c, v, p)$ where $\iota$ is instruction, $B$ step budget, $c$ core functionality, $v$ variation type, $p$ prerequisites.
3. **Hierarchical Rollout Evaluation** (MobileGym-Critic): An agentic evaluator that converts execution logs into visual action traces → produces structured JSON verdict with:
   - **Trajectory outcome label** $z_k \in \{0,1\}$ (success/failure)
   - **Step-level process labels** $\ell_k^{(t)} = (v_k^{(t)}, e_k^{(t)})$ where $v_k^{(t)} \in \{0,1\}$ indicates reasonable/unreasonable step, $e_k^{(t)}$ is rationale
   - **Corrective hint** $h_k$ summarizing key mistakes and suggested alternatives

### HiFPO: Feedback-Guided Policy Optimization
1. **Hint-Guided Multi-Attempt Rollout**: For each task, run $K$ attempts serialized so earlier feedback conditions later attempts. Hint context $\eta_{<k} = \text{Aggregate}(h_1, \dots, h_{k-1})$ is appended to task instruction before next attempt.
2. **Task Filtering**: Compute empirical success rate $SR(x) = \frac{1}{K}\sum_{k=1}^K z_k$. Remove tasks with $SR(x)=1$ (already mastered); retain all-fail and mixed tasks.
3. **Trajectory and Step Selection**: For each retained task, select one informative attempt (best successful with cleanest process feedback, or best failure). Keep only reasonable steps (where $v_k^{(t)}=1$) as training set $D$.
4. **Hint-Contextualized Step-Level GRPO**: For each training step, sample $G$ candidate responses $\hat{o}_{j,g} \sim \pi_{\theta_{\text{old}}}(\cdot|\tilde{s}_j)$ where $\tilde{s}_j$ is hint-contextualized prompt. Compute adaptive GUI action reward:
   $$R_{j,g} = \lambda_{\text{type}} r_{j,g}^{\text{type}} + \lambda_{\text{arg}} r_{j,g}^{\text{arg}}$$
   where $r^{\text{type}}$ checks action type correctness and $r^{\text{arg}}$ checks argument correctness only when type is correct.

   Rewards normalized within group to compute advantage:
   $$A_{j,g} = \frac{R_{j,g} - \mu_j}{\sigma_j + \epsilon_{\text{std}}}$$

   Loss function with clipped importance ratios and KL regularization:
   $$\mathcal{L}_{\text{HiFPO}}(\theta) = -\mathbb{E}_{j,g}[\min(\rho_{j,g}A_{j,g}, \bar{\rho}_{j,g}A_{j,g})] + \beta \mathbb{E}_j[D_{\text{KL}}^j(\theta)]$$

## Empirical Validation / Results

### Experimental Protocol
- **In-domain**: AndroidWorld (116 tasks, Pass@1/2/3)
- **Out-of-domain**: MobileWorld GUI-only (117 tasks, no training data used)
- **Base agents**: Qwen3-VL-8B (generalist), GUI-Owl-1.5-8B (GUI-specialized)
- **Training data**: 3,249 candidate tasks from 20 apps; main results use 900-task subsets

### In-Domain Adaptation (AndroidWorld)

| Agent | Tasks | Pass@1 | Pass@2 | Pass@3 | Easy | Medium | Hard |
|-------|-------|--------|--------|--------|------|--------|------|
| Qwen3-VL-8B | 0 | 40.5% | 49.1% | 55.2% | 44.8% | 35.2% | 19.3% |
| ForgeQwen3-8B | 900 | **50.9%** | **60.3%** | **67.2%** | **61.2%** | **41.7%** | 17.5% |
| GUI-Owl-1.5-8B | 0 | 56.0% | 68.1% | 69.0% | 66.7% | 50.0% | 19.3% |
| ForgeOwl-8B | 900 | **67.2%** | **75.0%** | **77.6%** | **73.2%** | **57.4%** | **29.8%** |

*Table 1 summary: ForgeOwl-8B achieves 77.6% Pass@3 (+12.5% relative gain over base). ForgeQwen3-8B narrows gap to GUI-Owl base (67.2% vs 69.0%).*

### Cross-Domain Generalization (MobileWorld GUI-only)

| Agent | Success Rate |
|-------|-------------|
| GUI-Owl-1.5-32B | 43.9% |
| **ForgeOwl-8B (Ours)** | **41.0%** |
| GUI-Owl-1.5-8B | 37.6% |
| MAI-UI-8B | 27.5% |
| ForgeQwen3-8B (Ours) | 10.3% (+35.5% over Qwen3-VL-8B's 7.6%) |

*ForgeOwl-8B surpasses all open-data mobile GUI agents and approaches much larger closed-data models.*

### Key Ablations

**Corrective Hints (Table 3)** – Removing hints reduces overall rollout success from 77.0% to 52.0% (25 pp drop) and Pass@3 from 72.5% to 49.0%.

**Training Objective (Table 4)** – Hint-contextualized GRPO outperforms both no-hint SFT and hint SFT (47.4% vs 45.7% vs 34.5% Pass@1 at 200 tasks; 50.9% vs 47.4% vs 44.0% at 900 tasks).

**Task Filtering (Table 5)** – Retaining all-fail and mixed tasks ([0.0, 0.9]) gives best combined AndroidWorld + MobileWorld result (48.3% Pass@1, 15/117 MobileWorld).

**Evaluator Model (Table 6)** – Gemini 2.5 Pro gives best results, but Qwen3-VL-8B as evaluator still improves base policy (44.8% Pass@1 vs 40.5%).

**Curriculum Grounding (Table 7)** – MobileGym-Curriculum covers broader functions (shopping lists, cooking assistant, meal planner) vs landing-screen baseline which over-concentrates on recipe creation/deletion.

### Error Analysis
Largest failure reductions in verification (-38.1 pp), search, complex UI, screen reading, repetition. Hard cases: game-playing, multi-app tasks, memorization/math-counting remain unsolved.

## Theoretical and Practical Implications

### Theoretical Contributions
1. **Hierarchical feedback framework**: Separates trajectory outcomes, step-level process labels, and corrective hints – each serving different roles in policy learning.
2. **Hint-contextualized GRPO**: Extends group-relative policy optimization to be conditioned on accumulated corrective hints, making step-level advantages depend on reusable feedback rather than just the current state.
3. **Step-level supervision from trajectories**: Demonstrates that long-horizon mobile trajectories can be converted into dense step-level training signals using process feedback, without requiring a learned reward model.

### Practical Implications
- **Annotation-free adaptation is viable**: The system matches or exceeds closed-data GUI-specialized agents using only automatically generated data.
- **Cross-domain transfer**: Adaptation on AndroidWorld generalizes to MobileWorld without any target-platform data, suggesting the approach captures generalizable GUI skills.
- **Open-source release**: Code, data, and models will be released at mobile-forge.github.io, enabling further research.

## Conclusion

MobileForge addresses two key bottlenecks in annotation-free GUI learning: the lack of a unified adaptation substrate and the weakness of isolated rollouts with coarse feedback. **MobileGym** grounds task generation and evaluation in real target-app interaction, while **HiFPO** performs hint-guided multi-attempt rollout and transforms hierarchical feedback into step-level GRPO updates.

The strongest checkpoint, **ForgeOwl-8B**, achieves **77.6% Pass@3** on AndroidWorld and **41.0%** on MobileWorld GUI-only – the strongest open-data mobile GUI agent in their evaluation. The system improves both generalist VLMs and GUI-specialized models.

**Future work** should extend to broader app ecosystems, longer multi-app workflows, and explicit safety constraints for real user devices.

---

_Markdown view of https://picx.dev/p/g53rfs, served by PicX — AI-generated visual whiteboard summaries of research papers._
