Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining - Summary

Summary (Overview)

Introduces Video2GUI, a fully automated, scalable framework for extracting high-quality GUI interaction trajectories from billions of unlabeled Internet tutorial videos, eliminating the need for costly manual annotation.
Constructs WildGUI, the largest open-source GUI pre-training dataset to date, containing 12.7 million trajectories and 124.5 million screenshots spanning over 1,500 applications and websites across desktop, mobile, and web platforms.
Demonstrates significant performance gains through pre-training on WildGUI: models (Qwen2.5-VL and Mimo-VL) show consistent improvements of 5–20% across multiple GUI grounding and agent benchmarks, matching or surpassing state-of-the-art performance.
Proposes a two-stage training strategy: (1) Continual pre-training on WildGUI with a mixed objective (L pretrain = L ground + L action + L traj), followed by (2) post-training on curated high-quality datasets to refine task-specific capabilities.
Shows strong scaling effects: Agent performance on grounding and agentic tasks improves steadily with increased pre-training data scale, up to 200 billion tokens, without evident saturation.

Introduction and Theoretical Foundation

Recent advances in multimodal LLMs have driven interest in developing autonomous agents capable of interacting with Graphical User Interfaces (GUIs) to automate tasks. A key bottleneck is the scarcity of large-scale, diverse training data that captures authentic user behavior patterns. Existing datasets rely heavily on manual annotation or simulation, which are costly and limit scalability and generalization.

Internet videos constitute a vast, untapped repository of real-world GUI demonstrations. However, leveraging this resource presents significant challenges: the immense diversity makes it difficult to filter high-quality instructional content, and raw videos lack the explicit, structured interaction annotations required for training. To address these challenges, this paper introduces Video2GUI, a framework to automatically filter and annotate GUI trajectories from web videos at scale.

The GUI agent interaction is formulated as a Partially Observable Markov Decision Process (POMDP), defined by the tuple $( \mathcal{U} , \mathcal{S} , \mathcal{A} , \mathcal{O} , \mathcal{T} )$ .

$\mathcal{U}$ : High-level user instruction space.
$\mathcal{S}$ : Environment state space.
$\mathcal{A}$ : Action space of atomic GUI operations.
$\mathcal{O}$ : Observation space (e.g., screenshots).
$\mathcal{T}$ : Transition function $\mathcal{T}: \mathcal{S} \times \mathcal{A} \rightarrow \mathcal{S}$ .

At each step $t$ , the agent $\pi_\theta$ receives an observation $o_{t-1}$ and selects an action $a_t$ based on its policy and interaction history $e_{t-1} = (u, a_1, o_1, ..., a_{t-1}, o_{t-1})$ . Each action $a_t$ is parameterized as a tuple $(\tau_t, b_t)$ , where $\tau_t$ is the action type (e.g., click, type) and $b_t$ are the action parameters (e.g., coordinates). Executing $a_t$ leads to a new state $s_t$ and observation $o_t$ , forming a trajectory $e_n = (u, a_1, o_1, ..., a_n, o_n)$ .

Methodology

The Video2GUI pipeline converts raw internet videos into structured interaction trajectories $\mathcal{D} = \{ (u, e)^{(i)} \}_{i=1}^{|\mathcal{D}|}$ through three stages (see Figure 1 overview).

1. Coarse-to-Fine Video Filtering

Meta Info Filtering (Coarse): Starting from 500M+ YouTube video metadata entries, a lightweight Qwen2.5-7B model, fine-tuned on DeepSeek-V3 annotated data, classifies videos based on titles, descriptions, and keywords for GUI tutorial relevance. This reduces candidates to ~20M videos.
Video Quality Scoring (Fine): A Qwen2.5-Omni model, fine-tuned on Gemini-3-Pro annotated samples, directly analyzes the first minute of video content. It scores videos on three dimensions (see Table 5 for criteria):
1. Topic Relevance: Focus on teaching GUI operations.
2. Instruction Clarity: Clarity of narration/guidance.
3. Recording Quality: Visual clarity and stability. Videos scoring ≥4.2 on all dimensions are retained, yielding 4.16M high-quality videos (~300k hours).

2. Trajectory Extraction

Given a filtered video $V$ , the goal is to extract a set of instruction-trajectory pairs:

\mathcal{D}^{(V)} = \{ (u^{(k)}, e^{(k)}) \}_{k=1}^{N}

where $e^{(k)}$ is the interaction trajectory for task $u^{(k)}$ . Gemini-3-Pro is used as the annotation model with a sliding-window strategy (4-minute segments) augmented with historical context memory to handle long videos and maintain task coherence. For each segment, the model extracts:

User task instruction.
Dense caption and task plan.
Platform, software, and website.
Action trajectory: A chronologically ordered list where each action includes timestamp, type, low-level grounding instruction, action rationale, and parameters (see Table 6 for desktop/mobile action spaces).

3. Action Spatial Grounding

To map extracted low-level instructions to precise screen coordinates, high-resolution frames are retrieved for each action timestamp: $O_t = \{ o_{t-0.5s}, o_t, o_{t+0.5s} \}$ . Gemini-3-Pro is used again to determine if the action can be grounded on each frame and to predict the grounding target $b_t$ :

b_t = g_\phi(o_{t-0.5s}, o_t, o_{t+0.5s}, \tau_t)

The first frame with a valid grounding result is selected. Manual verification shows >95% accuracy.

4. Agent Training Strategy

A two-stage strategy is adopted to leverage the synthetic data:

Stage 1: Continual Pre-training on WildGUI: The model is trained for one epoch (~200B tokens) with a mixed objective combining three tasks: $\mathcal{L}_{\text{pretrain}} = \mathcal{L}_{\text{ground}} + \mathcal{L}_{\text{action}} + \mathcal{L}_{\text{traj}}$
1. GUI Grounding: Localize target UI elements.
2. GUI Action Prediction: Predict next action from a single screenshot and instruction.
3. GUI Trajectory Modeling: Autoregressively predict actions from sequences of screenshots and history.
Stage 2: Post-training: The pre-trained model is fine-tuned for three epochs (~15B tokens) on curated high-quality open-source datasets (e.g., AndroidControl, AITW) to align with precise human supervision and improve downstream task performance.

Empirical Validation / Results

Models are evaluated on GUI grounding, offline agent, and online agent benchmarks. Implementation uses the Megatron framework with a maximum of 4,096 visual tokens and a 32,768 sequence length.

Main Results

GUI Grounding Evaluation (Table 2)

On OSWorld-G, Mimo-VL-7B + WildGUI achieves a state-of-the-art average score of 67.6, surpassing Qwen3-VL-32B (60.6) and the proprietary Seed1.5-VL (62.9).
On ScreenSpot-Pro, Mimo-VL-7B + WildGUI scores 56.9, outperforming all open-source models and ranking second only to Seed1.5-VL (60.9). Qwen2.5-VL-7B shows a 15.1 point gain.

Offline GUI Agent Evaluation (Table 3)

On AndroidControl, Mimo-VL-7B + WildGUI achieves a Step Success Rate (SR) of 91.8 (Low-level) and 71.4 (High-level), significantly improving over the base model.
On the cross-lingual CAGUI benchmark, Mimo-VL-7B + WildGUI attains a Type Accuracy of 90.3 and SR of 71.0, demonstrating strong generalization.

Online GUI Agent Evaluation (Figure III)

On AndroidWorld, the full Stage1+Stage2 pipeline achieves a Success Rate (SR) of 31.9%, nearly doubling the base model's 16.4% and outperforming the Stage2-only baseline (23.3%).
On OSWorld, the model reaches 12.3% SR, compared to 10.4% for Stage2-only training. This shows that offline pre-training provides a critical foundation for generalization to dynamic, online environments.

Analysis

Scaling Effects (Figure 3) Performance on ScreenSpot-Pro and OSWorld-G shows a strong positive correlation with the scale of pre-training data (tokens). Accuracy improves consistently up to 200B tokens, surpassing the Stage2-only baseline and showing no saturation.

Ablation Studies (Table 4) Ablations on Mimo-VL-7B reveal the contribution of each training objective:

w/o L_traj: Maintains static task performance but significantly drops on AndroidWorld (31.9 → 24.1), highlighting its importance for long-horizon planning.
w/o L_ground: Causes substantial degradation on ScreenSpot-Pro (56.9 → 49.8), confirming its necessity for accurate action grounding.
w/o Stage 2: Leads to catastrophic drops across all metrics, especially on AndroidWorld (6.0), underscoring the need for alignment with high-quality human data.

Data Quality Check (Figure 4) A user study with five expert evaluators (Krippendorff's $\alpha = 0.84$ ) confirms the effectiveness of the filtering pipeline:

Video Quality: Average score improved from 1.22 (No Filter) to 4.45 after meta info filtering and video scoring.
Trajectory Quality: WildGUI achieved the highest overall score of 4.62, outperforming baselines TongUI (3.35) and VideoAgentTrek (4.05).

Dataset Statistics (Figures 6 & 7) WildGUI is diverse and large-scale:

Platforms: 65.8% Windows, 13.1% Mac, 12.7% Android, 4.5% iOS, 3.9% Linux.
Software Categories: 43.4% Internet & Communication, 20.4% Design & Media, 13.0% Development & IT.
Website Categories: 34.8% Development & AI Tools, 26.5% Business & Cloud.
Action Types: Click is the most frequent action (56.1% on Desktop, 67.0% on Mobile).

Theoretical and Practical Implications

Theoretical: Demonstrates that large-scale, diverse pre-training data, even if synthetically generated from unstructured sources, is crucial for building generalist GUI agents with strong grounding, planning, and execution capabilities. The formulation as a POMDP and the mixed training objective provide a robust framework for agent development.
Practical: The Video2GUI pipeline offers a scalable, cost-effective solution (estimated ~$0.0763 per sample for API costs) for generating massive GUI interaction datasets, reducing reliance on manual labor. The release of the WildGUI dataset (12.7M trajectories) provides a valuable resource for the community to advance GUI agent research. The performance gains show that pre-training on such data can elevate compact open-source models (7B parameters) to compete with or surpass much larger models and proprietary systems on complex GUI tasks.

Conclusion

Video2GUI addresses the data scarcity challenge in GUI agent training by introducing a fully automated framework to synthesize high-quality interaction trajectories from unlabeled internet videos. The constructed WildGUI dataset is the largest of its kind, enabling significant improvements in model generalization across grounding and agentic tasks. The results validate that scaling training with diverse, offline video data is a promising pathway toward more capable and generalized autonomous GUI agents. The authors will release both the WildGUI dataset and the Video2GUI pipeline to facilitate future research.

Critical Tables Preserved:

Table 1: Comparison with existing datasets.

Dataset	Platform	Scale & Statistics	Inst. Level
	Website	Mobile	Desktop
MiniWoB++	✓	✓	✗
MIND2WEB	✓	✗	✗
AITW	✗	✓	✗
AndroidControl	✗	✓	✗
GUI-World	✓	✓	✓
GUI-Odessey	✗	✓	✗
GUI-Act	✓	✗	✗
GUI-Net	✓	✓	✓
MONDAY	✗	✓	✗
GUI-360°	✗	✗	✓
WildGUI (Ours)	✓	✓	✓

Table §: Performance comparison on ScreenSpot-Pro and OSWorld-G. | Agent Model | ScreenSpot-Pro | OSWorld-G | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | | Text | Icon | Avg | Text Match. | Elem. Rec. | Layout Und. | Fine-grained | Avg | | Proprietary Models | | Gemini-2.5-Pro | - | - | 11.4 | 59.8 | 45.5 | 49.0 | 33.6 | 45.2 | | Seed1.5-VL | - | - | 60.9 | 73.9 | 66.7 | 69.6 | 47.0 | 62.9 | | Open-Source Models | | Qwen3-VL-2B* | 56.1 | 18.9 | 41.9 | 61.7 | 45.8 | 54.2 | 39.6 | 45.9 | | GTA1-7B | 65.5 | 25.3 | 50.1 | 42.1 | 65.7 | 62.7 | 56.1 | 55.1 | | UI-Venus-7B | 67.1 | 24.3 | 50.8 | 74.6 | 60.5 | 61.5 | 45.5 | 58.8 | | OpenCUA-7B | - | - | 50.0 | - | - | - | - | 55.3 | | GUI-Owl-7B | 69.4 | 31.5 | 54.9 | 64.8 | 63.6 | 61.3 | 41.0 | 55.9 | | Qwen3-VL-8B* | 67.6 | 21.3 | 49.9 | 69.0 | 55.5 | 59.7 | 47.7 | 54.8 | | Qwen3-VL-32B* | 73.4 | 25.0 | 54.9 | 72.8 | 63.3 | 66.4 | 51.7 | 60.6 | | UI-TARS-72B | 50.9 | 17.5 | 38.1 | 69.4 | 60.6 | 62.9 | 45.6 | 57.1 | | Effectiveness of WildGUI (Ours) | | Qwen2.5-VL-7B* | - | - | 26.8 | 41.4 | 28.8 | 34.8 | 13.4 | 27.3 | | + WildGUI | 57.0 | 17.6 | 41.9 (↑15.1) | 70.0 | 54.6 | 57.7 | 46.2 | 53.7 (↑26.4) | | Mimo-VL-7B | 55.7 | 18.4 | 41.2 | 65.0 | 59.2 | 59.0 | 40.2 | 54.7 | | + WildGUI | 70.1 | 33.6 | 56.9 (↑15.7) | 80.8 | 68.3 | 71.1 | 61.4 | 67.6 (↑12.9) |

Results marked with ‘’ are evaluated by us.*

**Table