MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Summary (Overview)

  • Proactive context management: Introduces ConAct (Context-as-Action), a paradigm that treats context management as first-class actions emitted by the same policy that selects UI actions, replacing passive ReAct-style prompting.
  • Three structured context fields: Folded Action History, Folded UI State, and Recent Step Record keep context compact while preserving critical UI facts across long trajectories.
  • MemGUI-3K dataset: A 2,956-trajectory dataset with full ConAct annotations, averaging 28.8 steps (1.9× longer than previous datasets), supporting supervised training and offline analysis.
  • State-of-the-art results: Zero-shot Qwen3-VL-235B-Thinking with ConAct achieves 62.5% Pass@3 on MemGUI-Bench, surpassing agentic frameworks with Gemini-2.5-Pro; MemGUI-8B-SFT achieves the best open-data 8B performance and generalizes to MobileWorld.
  • Complementary components: Ablation shows that memory, folding, and self-description each target different failure modes; together they reduce total failures by 41%.

Introduction and Theoretical Foundation

Recent MLLM-based mobile GUI agents can understand screenshots, reason over user goals, and control devices, but remain unreliable on long-horizon tasks requiring retention of intermediate facts across many steps and app transitions. For example, GUI-Owl-1.5-8B drops from 71.6% on AndroidWorld (avg. 8.4 steps) to 38.2% on MobileWorld (avg. 27.8 steps) and further to 11.7% on MemGUI-Bench (avg. 36.2 steps).

The core problem is context management:

  • Prompt explosion: Prompts grow linearly with task horizon as per-step records are passively appended.
  • Information loss: Critical cross-app facts (prices, identifiers, copied text) are diluted, paraphrased, truncated, or forgotten.

Existing strategies fail to provide both compact working context and persistent UI-derived facts:

  • External memory moves curation outside the end-to-end policy.
  • Prompt-based/rule-based methods either accumulate passive logs or discard old facts mechanically.

This motivates a policy-level mechanism that decides what to compress, what to remember, and what to keep available while acting.

Methodology

Problem Formulation

Mobile GUI automation is formulated as a sequential decision problem with a structured working context. Given task goal GG and screenshot ItI_t, the agent observes:

St=(G,Ht,Mt,Lt)S_t = (G, H_t, M_t, L_t)

where HtH_t, MtM_t, and LtL_t denote Folded Action History, Folded UI State, and Recent Step Record, storing compressed trajectory summaries, persistent UI-derived facts, and latest-step details.

At each step, the MLLM policy emits a joint ConAct decision:

yt=(τt,ϕt,at,ot,ιt)πθ(It,St)y_t = (\tau_t, \phi_t, a_t, o_t, \iota_t) \sim \pi_\theta(\cdot \mid I_t, S_t)

where τt\tau_t is reasoning, ϕt\phi_t is a folding directive, ata_t is a UI or memory action, oto_t is the UI observation, and ιt\iota_t is the action intent.

Step Output Protocol

The model emits a 5-part structured output:

  • <thinking> reasoning </thinking>
  • <folding> {range, summary} </folding>
  • <tool_call> UI / memory action </tool_call>
  • <ui_observation> screen facts </ui_observation>
  • <action_intent> next-step plan </action_intent>

The action space is:

A=AuiAmem,Amem={add,update,delete}A = A_{ui} \cup A_{mem}, \quad A_{mem} = \{ \text{add}, \text{update}, \text{delete} \}

History Folding

The folding directive is ϕt=([st,t],zt)\phi_t = ([s_t, t], z_t), where [st,t][s_t, t] is the history span to compress and ztz_t is the generated summary. The folded history updates as:

Ht+1=Fold(Ht,ϕt)H_{t+1} = \text{Fold}(H_t, \phi_t)

UI Memory Actions

Each memory item is a structured triple m=(id,d,c)m = (id, d, c) (identifier, description, content). Memory actions induce:

Mt+1={Add(Mt,m),at=a+,Update(Mt,id,c),at=a,Delete(Mt,id),at=a,Mt,atAui.M_{t+1} = \begin{cases} \text{Add}(M_t, m), & a_t = a^+, \\ \text{Update}(M_t, id, c), & a_t = a^\circ, \\ \text{Delete}(M_t, id), & a_t = a^-, \\ M_t, & a_t \in A_{ui}. \end{cases}

Self-Describing Step Output

The two self-describing fields form:

Lt+1=(ot,ιt,at,rt)L_{t+1} = (o_t, \iota_t, a_t, r_t)

State Transition

The environment/tool result is:

(rt,It+1)={Env(It,at),atAui,(ok,It),atAmem.(r_t, I_{t+1}) = \begin{cases} \text{Env}(I_t, a_t), & a_t \in A_{ui}, \\ (\text{ok}, I_t), & a_t \in A_{mem}. \end{cases}

The complete state transition is:

St+1=T(St,yt,rt)=(G,Ht+1,Mt+1,Lt+1)S_{t+1} = T(S_t, y_t, r_t) = (G, H_{t+1}, M_{t+1}, L_{t+1})

MemGUI-3K Dataset

  • Constructed from 128 MemGUI-Bench seed tasks expanded to a 7,303-task pool.
  • Teacher: Qwen3-VL-235B-Thinking with ConAct.
  • Filtered to 2,956 successful trajectories (64,430 SFT samples) across 26 apps.
  • Average trajectory length: 28.8 steps (median 25).
  • 65.1% of trajectories contain memory actions; 23.8% of folds are span-level (avg. span 6.25 steps).

Empirical Validation / Results

Main Results on MemGUI-Bench

AgentMemory TypeEasy P@1Easy P@3Medium P@1Medium P@3Hard P@1Hard P@3Overall P@1Overall P@3Overall IRR
M3A [16]Memory Agent39.647.935.750.021.144.732.847.739.3
Qwen3-VL-235B-Thinking [2]Action-Thought18.847.928.647.626.344.724.246.930.0
MemGUI-Agent-235B (Ours)ConAct41.768.835.761.934.255.337.562.546.8
Qwen3-VL-8B-Instruct [2]Action-Thought18.835.44.814.32.67.99.420.315.1
MemGUI-8B-SFT (Ours)ConAct25.039.623.833.321.134.223.435.930.2

MobileWorld GUI-Only Results

AgentSR (%)
GUI-Owl-1.5-8B-Instruct [20]38.2
MemGUI-Agent-235B (Ours, zero-shot)29.1
Qwen3-VL-235B-Thinking [2]14.5
OpenMobile-8B [3]17.7
MemGUI-8B-SFT (Ours)17.9

Offline Skill Analysis (ConAct Subskills)

Metric8B-Inst. (ZS)MemGUI-8B-SFTMemGUI-Agent-235B
UI Action Match (%)29.236.3+7.133.9
Memory Trigger F1 (%)19.948.0+28.134.3
Deep Folding Ratio (%)8.826.1+17.318.7
Deep Range Accuracy (%)45.258.9+13.756.7
Format Tags Acc. (%)94.999.9+5.099.2

Ablation on MemGUI-Bench-40 (Qwen3-VL-235B-Thinking, zero-shot)

VariantP@1P@2P@3MTPRIRR
ReAct baseline5.020.027.50.14319.5
+ UI memory actions17.535.042.50.35739.1
+ history folding22.525.032.50.17925.7
+ self-describing step25.040.045.00.21433.1
Full ConAct (Ours)40.052.562.50.42951.0

Error Analysis

Figure 7 (reported in text): Full ConAct reduces total failures by 41% (99 → 58), mainly through process hallucination (−22, −42%) and output hallucination (−17, −57%). Knowledge deficiency, intent misunderstanding, and other errors remain roughly stable.

Theoretical and Practical Implications

  • Theoretical: Context management should be a first-class action in the policy, not a separate module or passive log. The three components (folding, memory, self-description) are complementary and each addresses a distinct failure mode.
  • Practical: ConAct can be applied zero-shot to large models (benefits only emerge at 235B-Thinking scale) or learned via SFT at 8B scale. The learned skills transfer out-of-distribution (MobileWorld). The approach reduces context-induced hallucinations but does not solve knowledge or intent errors, indicating where further progress is needed.
  • Dataset: MemGUI-3K provides long-horizon (avg. 28.8 steps) supervised trajectories with full context-management annotations, enabling training and offline analysis of proactive context decisions.

Conclusion

MemGUI-Agent is an end-to-end long-horizon mobile GUI agent that manages context inside the action policy through ConAct, unifying history folding, UI memory actions, and self-describing step outputs. Zero-shot ConAct with Qwen3-VL-235B-Thinking sets a new SOTA on MemGUI-Bench (62.5% Pass@3). The MemGUI-3K dataset enables MemGUI-8B-SFT to achieve the best open-data 8B performance and generalize to MobileWorld. The three ConAct components are complementary: folding controls context growth, memory preserves exact facts, and self-description grounds both.

Limitations: Experiments focus on Android-style mobile GUI environments; extending to iOS, desktop, and web interfaces is future work.

Related papers