MOBILEGYM: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

Summary (Overview)

MOBILEGYM is a lightweight, browser-hosted simulation environment for mobile GUI agents that focuses on interaction fidelity (realistic screens and responses) rather than replicating proprietary app backends.
Its core innovation is representing the full environment state as structured JSON, enabling deterministic state-based verification, easy configuration, snapshotting/forking for parallel rollouts, and detection of unintended side effects.
The accompanying MOBILEGYM-BENCH provides 416 parameterized task templates (256 test, 160 train) across 28 simulated apps, with a structured AnswerSheet protocol to avoid unreliable free-text answer matching.
Empirical results show a wide performance range (9.4%–58.8% SR) across 9 agents, and a Sim-to-Real study demonstrates that 95.1% of training gains from online RL (GRPO) in the simulator transfer to real-device execution.
The platform is highly efficient (~400 MB RAM per instance, ~3s cold start), enabling hundreds of parallel instances on a single server, making large-scale online RL feasible without heavyweight emulator clusters.

Introduction and Theoretical Foundation

Mobile GUI agents, which operate smartphones from screenshots and instructions, face a fundamental trade-off in their training and evaluation environments. Emulator-based environments (e.g., AndroidWorld, AndroidLab) offer repeatability but are limited to system utilities and simple open-source apps, are resource-intensive, and difficult to scale for online RL. Real-device benchmarks (e.g., MobileBench-OL) cover everyday apps but suffer from uncontrollable backend state, real-world consequences, app-version drift, and high costs, making episodes hard to reproduce and parallelize.

Neither approach provides the two capabilities needed for progress: 1) Verifiable outcome signals (deterministic, grounded in actual task state, not unreliable VLM judgments), and 2) Scalable online training (essential for handling dynamic GUI variations).

The barriers are inherent to real everyday apps: their state is unreadable (internal state is hard to inspect), unwritable (difficult to reset to known conditions), unforkable (no cheap replication for parallel rollouts), and many actions are irreversible.

The key insight of MOBILEGYM is that since GUI agents only observe screenshots and perform discrete actions, a lightweight simulator that provides interaction fidelity—producing realistic screens in response to agent actions—is sufficient. By representing all app data, OS settings, and device context as explicit structured JSON state, MOBILEGYM makes this state readable for verification, writable for configuration, forkable for parallelism, and fully sandboxed to avoid real-world consequences.

Methodology

3.1 System Design

Interaction Fidelity Target: MOBILEGYM simulates the agent-facing interaction surface: visual screens, touch/typing responses, navigation, cross-app handoffs, and task-relevant state transitions. It implements Android-like runtime mechanisms (task stacks, keyboard, notifications, intents, back-key dispatch) in the browser over structured local state.
Layered State Model: Environment state is separated into:
1. World Data: Large, read-only public entities (posts, products).
2. Runtime Overlay: Compact, per-environment mutable state (user profile, app settings, carts). Only this layer is exposed for configuration and judging.
3. OS Runtime: Core OS semantics and services. Final UI is composed as: Final UI = World Data + Runtime Overlay + OS Runtime.
Declarative Navigation: Each app's UI navigation is modeled as a declarative Extended Finite State Machine (EFSM), defined in a specification file that drives runtime navigation and enables static analysis.
Interfaces: The benchmark layer maps agent outputs to a unified 17-action abstraction. Agents observe only screenshots; actions are executed via Playwright with normalized coordinates.

3.2 State Programmability

Verifiable Outcome Signals: Task success is judged by programmatic state verification—a deterministic judge inspects the structured environment state, providing reliable signals for both benchmarks and RL rewards.
State Serialization & Forking: The full environment state can be serialized to and restored from JSON, enabling exact reset and snapshot-based forking from any point, which is crucial for RL methods like GRPO that require multiple rollouts from identical states.
Full Environment State Comparison: By comparing the initial and terminal structured states of an episode, MOBILEGYM can detect any mutation outside the task's expected outcome, reporting it as an Unexpected Side Effect (USE). This provides a deterministic diagnostic that screenshot or UI-tree judges cannot reliably offer.

4. The MOBILEGYM-BENCH

Task Taxonomy: Tasks are categorized along four orthogonal axes:
- Scope: S1 (single-app), S2 (two-app), S3 (three+ apps).
- Objective: Operate (state-changing), Query (information retrieval), Hybrid (both).
- Composition: Atomic, Sequential, Transfer (cross-app handoff), Deep-dive.
- Difficulty: L1–L4, calibrated post-hoc using eight reference models.
Parameterized Instantiation: The 416 entries are templates. At runtime, they are instantiated with: (i) instruction variation, (ii) parameter sampling, and (iii) environment configuration (injected app state). This yields over 27,000 distinct instances.
AnswerSheet Protocol: To avoid failures in free-text answer matching, query tasks require the agent to fill a GUI form (AnswerSheet) with typed fields (e.g., number, choice). Submitted state is checked by type-specific matchers, eliminating false accepts/rejects due to phrasing differences.

Empirical Validation / Results

5.1 Benchmark Results

Evaluation of 9 agents on the 256-task test set shows:

Model	Overall SR (%)	L1 SR (%)	L2 SR (%)	L3 SR (%)	L4 SR (%)	USE (%)
Proprietary models
Gemini 3.1 Pro	58.8 ± 1.4	97.5	83.6	63.3	21.9	5.5
Doubao-Seed-2.0-Pro	52.0 †	100.0	93.2	48.2	6.2	4.7
Qwen3.6-Plus	45.7 †	100.0	78.1	44.6	3.8	14.5
Open-source GUI-specialized models
AutoGLM-Phone-9B	20.0 ± 1.3	86.2	33.6	9.6	1.9	12.6
UI-TARS-1.5-8B	13.8 ± 1.7	77.5	21.9	3.0	1.6	11.0
UI-Venus-1.5-8B	15.4 ± 2.4	85.0	21.9	panel: 'recommend' }`<br>`when: { op: 'eq', left: { ref: 'appState', key: 'isFollowing' }, right: false } },`<br>`{ to: '/user/:mid', search: { menu: 'unfollow' }, when: { op: 'always' } } // fallback`<br>`]`

Reward Function for RL: The training reward is a shaped progress signal with multiplicative penalties. Let $p \in [0,1]$ denote task progress. The base reward is $r = p$ . For AnswerSheet tasks, if any field is wrong, progress is recomputed. Final reward applies discounts: $r \leftarrow p' \cdot 0.8^{I[\text{goal success} \land \neg \text{clean}]} \cdot 0.8^{I[\text{false complete} \land p'>0]} \cdot 0.5^{I[\text{post-success abort}]} \cdot 0.5^{I[\text{overdue}]}$ where $p'$ is the (potentially adjusted) progress.

4.3 Evaluation Protocol

Metrics:
- Success Rate (SR): Fraction of tasks judged successful.
- Progress Rate (PR): Fraction of subtasks passed.
- False Complete (FC): Agent declares completion without success.
- Unexpected Side Effects (USE): Unexpected state changes detected via full-state comparison.
- Overdue Termination (OT): Agent reaches goal but continues until truncation.
Execution: Simulator is reset before each task. Tasks have fixed step budgets (15, 30, 45, or 60 steps), with an extra 15 steps for AnswerSheet tasks.

Theoretical and Practical Implications

Democratizing Mobile Agent Research: MOBILEGYM lowers the barrier to entry for reproducible research and scalable training on everyday mobile tasks, eliminating the need for real accounts, device farms, or proprietary backends.
New Evaluation Paradigm: The combination of deterministic state-based judging and the AnswerSheet protocol moves beyond unreliable VLM judges and brittle string-matching heuristics, providing grounded, reliable metrics.
Enabling Scalable Online RL: The platform's efficiency (~400 MB/instance) makes large-scale parallel online RL feasible on commodity hardware, a capability previously restricted by the cost of emulator clusters or real devices.
Safety and Alignment Research: The sandboxed, consequence-free environment with full state control is ideal for studying agent robustness, prompt-injection susceptibility, refusal training, and the safety implications of high-risk operations (e.g., payments, deletions).
Sim-to.Real Transfer Validation: The study provides an existence proof that policies learned in an interaction-fidelity simulator can effectively transfer to real devices, retaining 95.1% of the simulation-side training gain.

Conclusion

MOBILEGYM transforms everyday mobile use into a fully controllable simulation environment for GUI agent research. By prioritizing interaction fidelity and representing state as structured JSON, it solves the core problems of state readability, writability, forkability, and consequence-free operation. The accompanying benchmark provides a diverse, reliably evaluated task suite. Results show significant headroom for improvement on everyday tasks and demonstrate effective Sim-to-Real transfer. The platform opens avenues for scalable online RL, safety alignment research, and the creation of custom, controllable mobile environments, advancing the field beyond the limitations of current emulator and real-device approaches.

Limitations include differences in visual details compared to real apps, modeling of server-driven dynamic content as controllable state rather than stochastic backends, and coverage of main app scenarios rather than every feature. Ethical considerations emphasize MOBILEGYM's role as a sandboxed research tool, its legality for academic use, and the double-edged nature of evaluating high-risk operation capabilities—highlighting the need for paired safety-alignment research.