# MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

> Context-as-Action treats context management as first-class policy actions, achieving 62.5% Pass@3 on MemGUI-Bench and 41% fewer failures.

- **Source:** [arXiv](https://arxiv.org/abs/2606.19926)
- **Published:** 2026-06-25
- **Permalink:** https://picx.dev/p/mIwt7U
- **Whiteboard:** https://picx.dev/p/mIwt7U/image

## Summary

# MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

## Summary (Overview)
- **Proactive context management**: Introduces ConAct (Context-as-Action), a paradigm that treats context management as first-class actions emitted by the same policy that selects UI actions, replacing passive ReAct-style prompting.
- **Three structured context fields**: Folded Action History, Folded UI State, and Recent Step Record keep context compact while preserving critical UI facts across long trajectories.
- **MemGUI-3K dataset**: A 2,956-trajectory dataset with full ConAct annotations, averaging 28.8 steps (1.9× longer than previous datasets), supporting supervised training and offline analysis.
- **State-of-the-art results**: Zero-shot Qwen3-VL-235B-Thinking with ConAct achieves 62.5% Pass@3 on MemGUI-Bench, surpassing agentic frameworks with Gemini-2.5-Pro; MemGUI-8B-SFT achieves the best open-data 8B performance and generalizes to MobileWorld.
- **Complementary components**: Ablation shows that memory, folding, and self-description each target different failure modes; together they reduce total failures by 41%.

## Introduction and Theoretical Foundation
Recent MLLM-based mobile GUI agents can understand screenshots, reason over user goals, and control devices, but remain unreliable on long-horizon tasks requiring retention of intermediate facts across many steps and app transitions. For example, GUI-Owl-1.5-8B drops from 71.6% on AndroidWorld (avg. 8.4 steps) to 38.2% on MobileWorld (avg. 27.8 steps) and further to 11.7% on MemGUI-Bench (avg. 36.2 steps).

The core problem is **context management**:
- **Prompt explosion**: Prompts grow linearly with task horizon as per-step records are passively appended.
- **Information loss**: Critical cross-app facts (prices, identifiers, copied text) are diluted, paraphrased, truncated, or forgotten.

Existing strategies fail to provide both compact working context and persistent UI-derived facts:
- External memory moves curation outside the end-to-end policy.
- Prompt-based/rule-based methods either accumulate passive logs or discard old facts mechanically.

This motivates a **policy-level mechanism** that decides what to compress, what to remember, and what to keep available while acting.

## Methodology

### Problem Formulation
Mobile GUI automation is formulated as a sequential decision problem with a structured working context. Given task goal $G$ and screenshot $I_t$, the agent observes:

$$S_t = (G, H_t, M_t, L_t)$$

where $H_t$, $M_t$, and $L_t$ denote Folded Action History, Folded UI State, and Recent Step Record, storing compressed trajectory summaries, persistent UI-derived facts, and latest-step details.

At each step, the MLLM policy emits a joint ConAct decision:

$$y_t = (\tau_t, \phi_t, a_t, o_t, \iota_t) \sim \pi_\theta(\cdot \mid I_t, S_t)$$

where $\tau_t$ is reasoning, $\phi_t$ is a folding directive, $a_t$ is a UI or memory action, $o_t$ is the UI observation, and $\iota_t$ is the action intent.

### Step Output Protocol
The model emits a 5-part structured output:

- `<thinking> reasoning </thinking>`
- `<folding> {range, summary} </folding>`
- `<tool_call> UI / memory action </tool_call>`
- `<ui_observation> screen facts </ui_observation>`
- `<action_intent> next-step plan </action_intent>`

The action space is:

$$A = A_{ui} \cup A_{mem}, \quad A_{mem} = \{ \text{add}, \text{update}, \text{delete} \}$$

### History Folding
The folding directive is $\phi_t = ([s_t, t], z_t)$, where $[s_t, t]$ is the history span to compress and $z_t$ is the generated summary. The folded history updates as:

$$H_{t+1} = \text{Fold}(H_t, \phi_t)$$

### UI Memory Actions
Each memory item is a structured triple $m = (id, d, c)$ (identifier, description, content). Memory actions induce:

$$
M_{t+1} =
\begin{cases}
\text{Add}(M_t, m), & a_t = a^+, \\
\text{Update}(M_t, id, c), & a_t = a^\circ, \\
\text{Delete}(M_t, id), & a_t = a^-, \\
M_t, & a_t \in A_{ui}.
\end{cases}
$$

### Self-Describing Step Output
The two self-describing fields form:

$$L_{t+1} = (o_t, \iota_t, a_t, r_t)$$

### State Transition
The environment/tool result is:

$$
(r_t, I_{t+1}) =
\begin{cases}
\text{Env}(I_t, a_t), & a_t \in A_{ui}, \\
(\text{ok}, I_t), & a_t \in A_{mem}.
\end{cases}
$$

The complete state transition is:

$$S_{t+1} = T(S_t, y_t, r_t) = (G, H_{t+1}, M_{t+1}, L_{t+1})$$

### MemGUI-3K Dataset
- Constructed from 128 MemGUI-Bench seed tasks expanded to a 7,303-task pool.
- Teacher: Qwen3-VL-235B-Thinking with ConAct.
- Filtered to 2,956 successful trajectories (64,430 SFT samples) across 26 apps.
- Average trajectory length: 28.8 steps (median 25).
- 65.1% of trajectories contain memory actions; 23.8% of folds are span-level (avg. span 6.25 steps).

## Empirical Validation / Results

### Main Results on MemGUI-Bench

| Agent | Memory Type | Easy P@1 | Easy P@3 | Medium P@1 | Medium P@3 | Hard P@1 | Hard P@3 | Overall P@1 | Overall P@3 | Overall IRR |
|---|---|---|---|---|---|---|---|---|---|---|
| M3A [16] | Memory Agent | 39.6 | 47.9 | 35.7 | 50.0 | 21.1 | 44.7 | 32.8 | 47.7 | 39.3 |
| Qwen3-VL-235B-Thinking [2] | Action-Thought | 18.8 | 47.9 | 28.6 | 47.6 | 26.3 | 44.7 | 24.2 | 46.9 | 30.0 |
| **MemGUI-Agent-235B (Ours)** | **ConAct** | **41.7** | **68.8** | **35.7** | **61.9** | **34.2** | **55.3** | **37.5** | **62.5** | **46.8** |
| Qwen3-VL-8B-Instruct [2] | Action-Thought | 18.8 | 35.4 | 4.8 | 14.3 | 2.6 | 7.9 | 9.4 | 20.3 | 15.1 |
| **MemGUI-8B-SFT (Ours)** | **ConAct** | 25.0 | 39.6 | 23.8 | 33.3 | 21.1 | 34.2 | 23.4 | 35.9 | 30.2 |

### MobileWorld GUI-Only Results
| Agent | SR (%) |
|---|---|
| GUI-Owl-1.5-8B-Instruct [20] | 38.2 |
| **MemGUI-Agent-235B (Ours, zero-shot)** | **29.1** |
| Qwen3-VL-235B-Thinking [2] | 14.5 |
| OpenMobile-8B [3] | 17.7 |
| **MemGUI-8B-SFT (Ours)** | **17.9** |

### Offline Skill Analysis (ConAct Subskills)

| Metric | 8B-Inst. (ZS) | MemGUI-8B-SFT | ∆ | MemGUI-Agent-235B |
|---|---|---|---|---|
| UI Action Match (%) | 29.2 | 36.3 | +7.1 | 33.9 |
| Memory Trigger F1 (%) | 19.9 | 48.0 | +28.1 | 34.3 |
| Deep Folding Ratio (%) | 8.8 | 26.1 | +17.3 | 18.7 |
| Deep Range Accuracy (%) | 45.2 | 58.9 | +13.7 | 56.7 |
| Format Tags Acc. (%) | 94.9 | 99.9 | +5.0 | 99.2 |

### Ablation on MemGUI-Bench-40 (Qwen3-VL-235B-Thinking, zero-shot)
| Variant | P@1 | P@2 | P@3 | MTPR | IRR |
|---|---|---|---|---|---|
| ReAct baseline | 5.0 | 20.0 | 27.5 | 0.143 | 19.5 |
| + UI memory actions | 17.5 | 35.0 | 42.5 | 0.357 | 39.1 |
| + history folding | 22.5 | 25.0 | 32.5 | 0.179 | 25.7 |
| + self-describing step | 25.0 | 40.0 | 45.0 | 0.214 | 33.1 |
| **Full ConAct (Ours)** | **40.0** | **52.5** | **62.5** | **0.429** | **51.0** |

### Error Analysis
Figure 7 (reported in text): Full ConAct reduces total failures by 41% (99 → 58), mainly through process hallucination (−22, −42%) and output hallucination (−17, −57%). Knowledge deficiency, intent misunderstanding, and other errors remain roughly stable.

## Theoretical and Practical Implications
- **Theoretical**: Context management should be a first-class action in the policy, not a separate module or passive log. The three components (folding, memory, self-description) are complementary and each addresses a distinct failure mode.
- **Practical**: ConAct can be applied zero-shot to large models (benefits only emerge at 235B-Thinking scale) or learned via SFT at 8B scale. The learned skills transfer out-of-distribution (MobileWorld). The approach reduces context-induced hallucinations but does not solve knowledge or intent errors, indicating where further progress is needed.
- **Dataset**: MemGUI-3K provides long-horizon (avg. 28.8 steps) supervised trajectories with full context-management annotations, enabling training and offline analysis of proactive context decisions.

## Conclusion
MemGUI-Agent is an end-to-end long-horizon mobile GUI agent that manages context inside the action policy through ConAct, unifying history folding, UI memory actions, and self-describing step outputs. Zero-shot ConAct with Qwen3-VL-235B-Thinking sets a new SOTA on MemGUI-Bench (62.5% Pass@3). The MemGUI-3K dataset enables MemGUI-8B-SFT to achieve the best open-data 8B performance and generalize to MobileWorld. The three ConAct components are complementary: folding controls context growth, memory preserves exact facts, and self-description grounds both.

**Limitations**: Experiments focus on Android-style mobile GUI environments; extending to iOS, desktop, and web interfaces is future work.

---

_Markdown view of https://picx.dev/p/mIwt7U, served by PicX — AI-generated visual whiteboard summaries of research papers._
