MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management
Summary (Overview)
- Proactive context management: Introduces ConAct (Context-as-Action), a paradigm that treats context management as first-class actions emitted by the same policy that selects UI actions, replacing passive ReAct-style prompting.
- Three structured context fields: Folded Action History, Folded UI State, and Recent Step Record keep context compact while preserving critical UI facts across long trajectories.
- MemGUI-3K dataset: A 2,956-trajectory dataset with full ConAct annotations, averaging 28.8 steps (1.9× longer than previous datasets), supporting supervised training and offline analysis.
- State-of-the-art results: Zero-shot Qwen3-VL-235B-Thinking with ConAct achieves 62.5% Pass@3 on MemGUI-Bench, surpassing agentic frameworks with Gemini-2.5-Pro; MemGUI-8B-SFT achieves the best open-data 8B performance and generalizes to MobileWorld.
- Complementary components: Ablation shows that memory, folding, and self-description each target different failure modes; together they reduce total failures by 41%.
Introduction and Theoretical Foundation
Recent MLLM-based mobile GUI agents can understand screenshots, reason over user goals, and control devices, but remain unreliable on long-horizon tasks requiring retention of intermediate facts across many steps and app transitions. For example, GUI-Owl-1.5-8B drops from 71.6% on AndroidWorld (avg. 8.4 steps) to 38.2% on MobileWorld (avg. 27.8 steps) and further to 11.7% on MemGUI-Bench (avg. 36.2 steps).
The core problem is context management:
- Prompt explosion: Prompts grow linearly with task horizon as per-step records are passively appended.
- Information loss: Critical cross-app facts (prices, identifiers, copied text) are diluted, paraphrased, truncated, or forgotten.
Existing strategies fail to provide both compact working context and persistent UI-derived facts:
- External memory moves curation outside the end-to-end policy.
- Prompt-based/rule-based methods either accumulate passive logs or discard old facts mechanically.
This motivates a policy-level mechanism that decides what to compress, what to remember, and what to keep available while acting.
Methodology
Problem Formulation
Mobile GUI automation is formulated as a sequential decision problem with a structured working context. Given task goal and screenshot , the agent observes:
where , , and denote Folded Action History, Folded UI State, and Recent Step Record, storing compressed trajectory summaries, persistent UI-derived facts, and latest-step details.
At each step, the MLLM policy emits a joint ConAct decision:
where is reasoning, is a folding directive, is a UI or memory action, is the UI observation, and is the action intent.
Step Output Protocol
The model emits a 5-part structured output:
<thinking> reasoning </thinking><folding> {range, summary} </folding><tool_call> UI / memory action </tool_call><ui_observation> screen facts </ui_observation><action_intent> next-step plan </action_intent>
The action space is:
History Folding
The folding directive is , where is the history span to compress and is the generated summary. The folded history updates as:
UI Memory Actions
Each memory item is a structured triple (identifier, description, content). Memory actions induce:
Self-Describing Step Output
The two self-describing fields form:
State Transition
The environment/tool result is:
The complete state transition is:
MemGUI-3K Dataset
- Constructed from 128 MemGUI-Bench seed tasks expanded to a 7,303-task pool.
- Teacher: Qwen3-VL-235B-Thinking with ConAct.
- Filtered to 2,956 successful trajectories (64,430 SFT samples) across 26 apps.
- Average trajectory length: 28.8 steps (median 25).
- 65.1% of trajectories contain memory actions; 23.8% of folds are span-level (avg. span 6.25 steps).
Empirical Validation / Results
Main Results on MemGUI-Bench
| Agent | Memory Type | Easy P@1 | Easy P@3 | Medium P@1 | Medium P@3 | Hard P@1 | Hard P@3 | Overall P@1 | Overall P@3 | Overall IRR |
|---|---|---|---|---|---|---|---|---|---|---|
| M3A [16] | Memory Agent | 39.6 | 47.9 | 35.7 | 50.0 | 21.1 | 44.7 | 32.8 | 47.7 | 39.3 |
| Qwen3-VL-235B-Thinking [2] | Action-Thought | 18.8 | 47.9 | 28.6 | 47.6 | 26.3 | 44.7 | 24.2 | 46.9 | 30.0 |
| MemGUI-Agent-235B (Ours) | ConAct | 41.7 | 68.8 | 35.7 | 61.9 | 34.2 | 55.3 | 37.5 | 62.5 | 46.8 |
| Qwen3-VL-8B-Instruct [2] | Action-Thought | 18.8 | 35.4 | 4.8 | 14.3 | 2.6 | 7.9 | 9.4 | 20.3 | 15.1 |
| MemGUI-8B-SFT (Ours) | ConAct | 25.0 | 39.6 | 23.8 | 33.3 | 21.1 | 34.2 | 23.4 | 35.9 | 30.2 |
MobileWorld GUI-Only Results
| Agent | SR (%) |
|---|---|
| GUI-Owl-1.5-8B-Instruct [20] | 38.2 |
| MemGUI-Agent-235B (Ours, zero-shot) | 29.1 |
| Qwen3-VL-235B-Thinking [2] | 14.5 |
| OpenMobile-8B [3] | 17.7 |
| MemGUI-8B-SFT (Ours) | 17.9 |
Offline Skill Analysis (ConAct Subskills)
| Metric | 8B-Inst. (ZS) | MemGUI-8B-SFT | ∆ | MemGUI-Agent-235B |
|---|---|---|---|---|
| UI Action Match (%) | 29.2 | 36.3 | +7.1 | 33.9 |
| Memory Trigger F1 (%) | 19.9 | 48.0 | +28.1 | 34.3 |
| Deep Folding Ratio (%) | 8.8 | 26.1 | +17.3 | 18.7 |
| Deep Range Accuracy (%) | 45.2 | 58.9 | +13.7 | 56.7 |
| Format Tags Acc. (%) | 94.9 | 99.9 | +5.0 | 99.2 |
Ablation on MemGUI-Bench-40 (Qwen3-VL-235B-Thinking, zero-shot)
| Variant | P@1 | P@2 | P@3 | MTPR | IRR |
|---|---|---|---|---|---|
| ReAct baseline | 5.0 | 20.0 | 27.5 | 0.143 | 19.5 |
| + UI memory actions | 17.5 | 35.0 | 42.5 | 0.357 | 39.1 |
| + history folding | 22.5 | 25.0 | 32.5 | 0.179 | 25.7 |
| + self-describing step | 25.0 | 40.0 | 45.0 | 0.214 | 33.1 |
| Full ConAct (Ours) | 40.0 | 52.5 | 62.5 | 0.429 | 51.0 |
Error Analysis
Figure 7 (reported in text): Full ConAct reduces total failures by 41% (99 → 58), mainly through process hallucination (−22, −42%) and output hallucination (−17, −57%). Knowledge deficiency, intent misunderstanding, and other errors remain roughly stable.
Theoretical and Practical Implications
- Theoretical: Context management should be a first-class action in the policy, not a separate module or passive log. The three components (folding, memory, self-description) are complementary and each addresses a distinct failure mode.
- Practical: ConAct can be applied zero-shot to large models (benefits only emerge at 235B-Thinking scale) or learned via SFT at 8B scale. The learned skills transfer out-of-distribution (MobileWorld). The approach reduces context-induced hallucinations but does not solve knowledge or intent errors, indicating where further progress is needed.
- Dataset: MemGUI-3K provides long-horizon (avg. 28.8 steps) supervised trajectories with full context-management annotations, enabling training and offline analysis of proactive context decisions.
Conclusion
MemGUI-Agent is an end-to-end long-horizon mobile GUI agent that manages context inside the action policy through ConAct, unifying history folding, UI memory actions, and self-describing step outputs. Zero-shot ConAct with Qwen3-VL-235B-Thinking sets a new SOTA on MemGUI-Bench (62.5% Pass@3). The MemGUI-3K dataset enables MemGUI-8B-SFT to achieve the best open-data 8B performance and generalize to MobileWorld. The three ConAct components are complementary: folding controls context growth, memory preserves exact facts, and self-description grounds both.
Limitations: Experiments focus on Android-style mobile GUI environments; extending to iOS, desktop, and web interfaces is future work.
Related papers
- GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents
No current method excels at utility, access control, and active forgetting in shared-memory agent benchmarks, with long-context prompting best but costly.
- World Action Models: A Survey
World Action Models unify vision-language-action and world models; the field trend is generating less of the future while preserving control information.
- Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
ZPPO injects teacher knowledge only into prompts via BCQ and NCQ on hard questions, outperforming distillation and RL at small scales.