Summary of "CUA-SUITE: Massive Human-Annotated Video Demonstrations for Computer-Use Agents"

Summary (Overview)

Introduces CUA-SUITE, a comprehensive ecosystem for training and evaluating desktop computer-use agents (CUAs), addressing the scarcity of high-quality, continuous human demonstration data.
Core component is VIDEO-CUA, the largest open expert video corpus for desktop use: ~55 hours (~6 million frames) of 30 fps recordings across ~10,000 tasks and 87 professional desktop applications, with kinematic cursor traces and multi-layered reasoning annotations.
Complementary resources: UI-VISION (a benchmark for evaluating grounding and planning) and GROUND-CUA (a massive grounding dataset with 56K annotated screenshots and 3.6M UI element annotations).
Preliminary evaluation reveals current foundation action models struggle substantially with professional desktop applications, showing ~60% task failure rate and highlighting spatial grounding as a primary bottleneck.
Designed for universality: The continuous video format is a superset of information that can be losslessly transformed into formats required by existing agent frameworks, supporting emerging research like screen parsing, continuous control, and visual world models.

Introduction and Theoretical Foundation

The vision of intelligent agents that can operate computers to execute complex workflows is compelling but progress is bottlenecked by data scarcity. While recent works emphasize continuous video as the critical missing ingredient for scaling agents (Redkar et al., 2026), existing open datasets like ScaleCUA contain only sparse screenshots equating to less than 20 hours of video. Such data lacks the temporal continuity required for visual world models or learning continuous spatial control policies.

CUA-SUITE addresses this gap by providing richly annotated human data with dense, multi-faceted feedback: continuous video trajectories, kinematic action traces, and precise UI grounding. This unified data engine supports the full stack of computer-use intelligence, from visual perception to action planning.

Methodology

The data creation process involves a human-centric pipeline:

Application Selection: 87 diverse, open-source desktop applications across 12 categories (e.g., Development, Graphics, Video/Audio) were selected to ensure broad coverage and permit free release.
Expert-Driven Task Design: Human experts designed ~10,000 realistic, goal-oriented tasks they would perform in a real work setting.
High-Fidelity Recording: Annotators executed tasks while the system captured continuous 30 fps screen video (~55 hours total) and logged every mouse/keyboard action with millisecond precision.
Dense UI Annotation: Keyframes preceding state-changing actions were extracted. Annotators manually labeled every visible UI element with bounding boxes, textual labels, and (for ~50% of elements) one of eight high-level functional categories (see Table 5). OCR was applied for long text segments.
Annotation Synthesis for VIDEO-CUA: A pipeline (adopted from OpenCUA) used Claude-Sonnet-4.5 to generate multi-layered reasoning annotations for each trajectory step: Observation, Thought, Action Description, and Reflection. This converts raw logs into the format $\tau_t = (s_t, o_t, r_t, d_t, a_t, s_{t+1}, ref_t)$ , averaging 496.7 words per step.
Derived Datasets: This core data was processed to create the three complementary resources: VIDEO-CUA (trajectories), GROUND-CUA (grounding data), and UI-VISION (evaluation benchmark).

Empirical Validation / Results

UI-VISION Benchmark Results

UI-VISION evaluates three capabilities: Element Grounding, Layout Grounding, and Action Prediction. A focus on Element Grounding reveals:

Table 1: Element Grounding Performance on UI-VISION

Model	Basic	Functional	Spatial	Avg.
MAI-UI-32B (Zhou et al., 2025b)	59.1	57.1	26.9	47.7
MAI-UI-8B (Zhou et al., 2025b)	51.7	49.6	22.5	41.3
OpenCUA-72B (Wang et al., 2025)	–	–	–	37.3
UI-Venus-Ground-72B (Gu et al., 2025)	45.6	42.3	23.7	37.2
PhiGround-7B + o3 (Zhang et al., 2025c)	44.2	43.8	20.5	36.2
OpenCUA-32B (Wang et al., 2025)	–	–	–	33.3
GUI-ARP-7B (Ye et al., 2025)	39.6	35.4	18.6	31.2
OpenCUA-7B (Wang et al., 2025)	–	–	–	29.7
Qwen3-VL-32B (Bai et al., 2025)	32.8	34.2	14.7	27.2
PhiGround-7B (Zhang et al., 2025c)	36.8	37.1	7.6	27.2
UI-Venus-Ground-7B (Gu et al., 2025)	36.1	32.8	11.9	26.9
InfiGUI-G1-7B (Liu et al., 2025b)	36.2	31.9	11.5	26.5
HyperClick (Zhang et al., 2025d)	35.3	32.1	11.0	26.1
UI-TARS-72B (Qin et al., 2025)	31.4	30.5	14.7	25.5
Qwen3-VL-8B (Bai et al., 2025)	25.0	27.9	1.2	18.0
UI-TARS-7B (Qin et al., 2025)	20.1	24.3	8.4	17.6

Key Findings:

Performance has nearly doubled since UI-VISION's introduction, with MAI-UI-32B achieving a new high of 47.7% average accuracy.
The Spatial task split remains stubbornly difficult (~27% for top model), indicating reasoning about spatial relationships is a major hurdle.
Scaling model parameters yields consistent benefits (e.g., OpenCUA improves from 29.7% to 37.3% from 7B to 72B).
Reasoning-enhanced models perform better (e.g., PhiGround-7B + o3 planner improves by 9.0 points over the base model).

VIDEO-CUA Action Prediction Evaluation

A preliminary evaluation of OpenCUA models on 256 sampled VIDEO-CUA tasks (87 apps) under a task-level instruction setting with 5-step history context:

Table 3: Action Prediction Results

Model	Preds	Mean Px ↓	Med. Px ↓	@20px ↑	@50px ↑
OpenCUA-7B	1,946	387.5	236.0	7.9%	16.5%
OpenCUA-32B	1,999	274.2	97.0	22.0%	37.7%

Key Findings:

Models exhibit limited accuracy under task-level prediction (37.7% @50px for 32B model).
Scaling from 7B to 32B yields consistent improvement (+21.2 points @50px).
Per-application performance varies widely (from 3.6% to 73.3% @50px), with specialized creative tools (e.g., Darktable, Krita) posing the greatest challenge.
Qualitative error analysis shows models struggle to disambiguate visually similar elements across complex, multi-panel interfaces of professional desktop apps (see Figure 2 examples).

Human Evaluation of Predicted Trajectories

A human evaluation of the OpenCUA-32B model's predictions across 576 steps assessed functional correctness:

Combined stepwise accuracy: 57.6% (332/576).
Action correctness: 85.9% (495/576) - models frequently identify the correct action type.
Grounding accuracy (coordinate-based steps): 52.4% (195/372) - models often fail to localize the target UI element precisely.
Non-coordinate steps (keyboard, text) achieve higher accuracy (67.6%), consistent with not requiring spatial grounding.
Per-task accuracy ranges from 0% to 100%, confirming high application-dependence.

Theoretical and Practical Implications

The evaluations converge on a clear conclusion: current foundation action models struggle substantially with professional desktop applications, with spatial grounding as the primary bottleneck. VIDEO-CUA directly targets this domain gap through:

Domain Coverage: 87 professional applications where models struggle most.
Video Scale: ~55 hours of continuous 30 fps recordings capturing full temporal dynamics.
Annotation Density: ~497 words per step providing rich supervisory signal.
Action Diversity: Complex interaction primitives (drags, fine mouse control) underrepresented in web-centric datasets.

CUA-SUITE's universality enables support for emerging research paradigms:

Generalist Screen Parsing: Dense, human-verified bounding-box annotations for all interactable elements, including canvas-based widgets.
Continuous Spatial Control: Complete kinematic cursor trajectories preserve human movement priors (e.g., Fitts's Law) for training imitation/RL policies.
Visual World Models: 30fps video paired with timestamped actions provides dense $(s_t, a_t, s_{t+1})$ triplets for action-conditioned video generation and lookahead planning.
Video-Based Reward Modeling: Continuous expert videos with task-level annotations are ideal for training reward models to assess task completion from execution video.

Conclusion

CUA-SUITE provides a comprehensive ecosystem to advance desktop computer-use agents. Its core, VIDEO-CUA, is the largest open expert video corpus for desktop use, complemented by pixel-precise grounding (GROUND-CUA) and rigorous evaluation (UI-VISION). Evaluations reveal persistent challenges in spatial grounding for professional applications. The dataset's continuous video streams, kinematic traces, and dense annotations support not only current training paradigms but also emerging research directions. All data, benchmarks, and models are publicly released to serve as a foundation for the next generation of general-purpose computer-use agents.