MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management

Summary (Overview)

Proactive context management: Introduces ConAct (Context-as-Action), a paradigm that treats context management as first-class actions emitted by the same policy that selects UI actions, replacing passive ReAct-style prompting.
Three structured context fields: Folded Action History, Folded UI State, and Recent Step Record keep context compact while preserving critical UI facts across long trajectories.
MemGUI-3K dataset: A 2,956-trajectory dataset with full ConAct annotations, averaging 28.8 steps (1.9× longer than previous datasets), supporting supervised training and offline analysis.
State-of-the-art results: Zero-shot Qwen3-VL-235B-Thinking with ConAct achieves 62.5% Pass@3 on MemGUI-Bench, surpassing agentic frameworks with Gemini-2.5-Pro; MemGUI-8B-SFT achieves the best open-data 8B performance and generalizes to MobileWorld.
Complementary components: Ablation shows that memory, folding, and self-description each target different failure modes; together they reduce total failures by 41%.

Introduction and Theoretical Foundation

Recent MLLM-based mobile GUI agents can understand screenshots, reason over user goals, and control devices, but remain unreliable on long-horizon tasks requiring retention of intermediate facts across many steps and app transitions. For example, GUI-Owl-1.5-8B drops from 71.6% on AndroidWorld (avg. 8.4 steps) to 38.2% on MobileWorld (avg. 27.8 steps) and further to 11.7% on MemGUI-Bench (avg. 36.2 steps).

The core problem is context management:

Prompt explosion: Prompts grow linearly with task horizon as per-step records are passively appended.
Information loss: Critical cross-app facts (prices, identifiers, copied text) are diluted, paraphrased, truncated, or forgotten.

Existing strategies fail to provide both compact working context and persistent UI-derived facts:

External memory moves curation outside the end-to-end policy.
Prompt-based/rule-based methods either accumulate passive logs or discard old facts mechanically.

This motivates a policy-level mechanism that decides what to compress, what to remember, and what to keep available while acting.

Methodology

Problem Formulation

Mobile GUI automation is formulated as a sequential decision problem with a structured working context. Given task goal $G$ and screenshot $I_t$ , the agent observes:

S_t = (G, H_t, M_t, L_t)

where $H_t$ , $M_t$ , and $L_t$ denote Folded Action History, Folded UI State, and Recent Step Record, storing compressed trajectory summaries, persistent UI-derived facts, and latest-step details.

At each step, the MLLM policy emits a joint ConAct decision:

y_t = (\tau_t, \phi_t, a_t, o_t, \iota_t) \sim \pi_\theta(\cdot \mid I_t, S_t)

where $\tau_t$ is reasoning, $\phi_t$ is a folding directive, $a_t$ is a UI or memory action, $o_t$ is the UI observation, and $\iota_t$ is the action intent.

Step Output Protocol

The model emits a 5-part structured output:

<thinking> reasoning </thinking>
<folding> {range, summary} </folding>
<tool_call> UI / memory action </tool_call>
<ui_observation> screen facts </ui_observation>
<action_intent> next-step plan </action_intent>

The action space is:

A = A_{ui} \cup A_{mem}, \quad A_{mem} = \{ \text{add}, \text{update}, \text{delete} \}

History Folding

The folding directive is $\phi_t = ([s_t, t], z_t)$ , where $[s_t, t]$ is the history span to compress and $z_t$ is the generated summary. The folded history updates as:

H_{t+1} = \text{Fold}(H_t, \phi_t)

UI Memory Actions

Each memory item is a structured triple $m = (id, d, c)$ (identifier, description, content). Memory actions induce:

M_{t+1} = \begin{cases} \text{Add}(M_t, m), & a_t = a^+, \\ \text{Update}(M_t, id, c), & a_t = a^\circ, \\ \text{Delete}(M_t, id), & a_t = a^-, \\ M_t, & a_t \in A_{ui}. \end{cases}

Self-Describing Step Output

The two self-describing fields form:

L_{t+1} = (o_t, \iota_t, a_t, r_t)

State Transition

The environment/tool result is:

(r_t, I_{t+1}) = \begin{cases} \text{Env}(I_t, a_t), & a_t \in A_{ui}, \\ (\text{ok}, I_t), & a_t \in A_{mem}. \end{cases}

The complete state transition is:

S_{t+1} = T(S_t, y_t, r_t) = (G, H_{t+1}, M_{t+1}, L_{t+1})

MemGUI-3K Dataset

Constructed from 128 MemGUI-Bench seed tasks expanded to a 7,303-task pool.
Teacher: Qwen3-VL-235B-Thinking with ConAct.
Filtered to 2,956 successful trajectories (64,430 SFT samples) across 26 apps.
Average trajectory length: 28.8 steps (median 25).
65.1% of trajectories contain memory actions; 23.8% of folds are span-level (avg. span 6.25 steps).

Empirical Validation / Results

Main Results on MemGUI-Bench

Agent	Memory Type	Easy P@1	Easy P@3	Medium P@1	Medium P@3	Hard P@1	Hard P@3	Overall P@1	Overall P@3	Overall IRR
M3A [16]	Memory Agent	39.6	47.9	35.7	50.0	21.1	44.7	32.8	47.7	39.3
Qwen3-VL-235B-Thinking [2]	Action-Thought	18.8	47.9	28.6	47.6	26.3	44.7	24.2	46.9	30.0
MemGUI-Agent-235B (Ours)	ConAct	41.7	68.8	35.7	61.9	34.2	55.3	37.5	62.5	46.8
Qwen3-VL-8B-Instruct [2]	Action-Thought	18.8	35.4	4.8	14.3	2.6	7.9	9.4	20.3	15.1
MemGUI-8B-SFT (Ours)	ConAct	25.0	39.6	23.8	33.3	21.1	34.2	23.4	35.9	30.2

MobileWorld GUI-Only Results

Agent	SR (%)
GUI-Owl-1.5-8B-Instruct [20]	38.2
MemGUI-Agent-235B (Ours, zero-shot)	29.1
Qwen3-VL-235B-Thinking [2]	14.5
OpenMobile-8B [3]	17.7
MemGUI-8B-SFT (Ours)	17.9

Offline Skill Analysis (ConAct Subskills)

Metric	8B-Inst. (ZS)	MemGUI-8B-SFT	∆	MemGUI-Agent-235B
UI Action Match (%)	29.2	36.3	+7.1	33.9
Memory Trigger F1 (%)	19.9	48.0	+28.1	34.3
Deep Folding Ratio (%)	8.8	26.1	+17.3	18.7
Deep Range Accuracy (%)	45.2	58.9	+13.7	56.7
Format Tags Acc. (%)	94.9	99.9	+5.0	99.2

Ablation on MemGUI-Bench-40 (Qwen3-VL-235B-Thinking, zero-shot)

Variant	P@1	P@2	P@3	MTPR	IRR
ReAct baseline	5.0	20.0	27.5	0.143	19.5
+ UI memory actions	17.5	35.0	42.5	0.357	39.1
+ history folding	22.5	25.0	32.5	0.179	25.7
+ self-describing step	25.0	40.0	45.0	0.214	33.1
Full ConAct (Ours)	40.0	52.5	62.5	0.429	51.0

Error Analysis

Figure 7 (reported in text): Full ConAct reduces total failures by 41% (99 → 58), mainly through process hallucination (−22, −42%) and output hallucination (−17, −57%). Knowledge deficiency, intent misunderstanding, and other errors remain roughly stable.

Theoretical and Practical Implications

Theoretical: Context management should be a first-class action in the policy, not a separate module or passive log. The three components (folding, memory, self-description) are complementary and each addresses a distinct failure mode.
Practical: ConAct can be applied zero-shot to large models (benefits only emerge at 235B-Thinking scale) or learned via SFT at 8B scale. The learned skills transfer out-of-distribution (MobileWorld). The approach reduces context-induced hallucinations but does not solve knowledge or intent errors, indicating where further progress is needed.
Dataset: MemGUI-3K provides long-horizon (avg. 28.8 steps) supervised trajectories with full context-management annotations, enabling training and offline analysis of proactive context decisions.

Conclusion

MemGUI-Agent is an end-to-end long-horizon mobile GUI agent that manages context inside the action policy through ConAct, unifying history folding, UI memory actions, and self-describing step outputs. Zero-shot ConAct with Qwen3-VL-235B-Thinking sets a new SOTA on MemGUI-Bench (62.5% Pass@3). The MemGUI-3K dataset enables MemGUI-8B-SFT to achieve the best open-data 8B performance and generalize to MobileWorld. The three ConAct components are complementary: folding controls context growth, memory preserves exact facts, and self-description grounds both.

Limitations: Experiments focus on Android-style mobile GUI environments; extending to iOS, desktop, and web interfaces is future work.