Visual Summary | Video-Based Reward Modeling for Computer-Use Agents

Here is a comprehensive, well-structured summary of the academic paper "Video-Based Reward Modeling for Computer-Use Agents":

Summary (Overview)

Main Contribution: Introduces a novel method for evaluating Computer-Use Agents (CUAs) by modeling rewards from execution videos (keyframe sequences of an agent's trajectory), which is independent of the agent's internal reasoning or actions.
Key Dataset: Introduces Execution Video Reward 53k (ExeVR-53k) dataset, containing 53k high-quality video-task-reward triplets for training reward models.
Key Technique: Proposes adversarial instruction translation to synthesize negative samples with step-level annotations, addressing the scarcity of failure data.
Key Innovation: Designs spatiotemporal token pruning (STP & TTP) to enable efficient learning from long, high-resolution execution videos by removing redundant tokens while preserving decisive UI changes.
Key Result: Fine-tunes an Execution Video Reward Model (ExeVRM) that achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models like GPT-5.2 and Gemini-3 Pro across multiple platforms.

Introduction and Theoretical Foundation

Background & Motivation: Computer-use agents are advancing rapidly, but scalable evaluation of whether an agent's trajectory truly fulfills a user instruction remains challenging. Traditional benchmarks rely on handcrafted scripts or task-specific rules, limiting scalability. A learned reward model that judges task success based on observable execution is a more flexible alternative.

Theoretical Basis: The research focuses on execution video—the observable sequence of interface states during interaction—as a method-agnostic representation comparable across different agent designs. This approach addresses two core challenges:

High Redundancy: CUA trajectories contain large static interface regions, while correctness depends on subtle local changes.
Limited Negative Supervision: Public datasets are dominated by successful trajectories, making it hard to build balanced training data for reward models.

Methodology

1. Dataset Construction (ExeVR-53k):

Unifies interaction data from multiple sources: AgentNet, ScaleCUA, and OSWorld.
Converts logs into a consistent step-level video representation: trajectories are segmented into atomic interaction steps, and a representative key frame is extracted per step.
The sequence of keyframes forms a compact video summary rendered at 1 FPS.

2. Adversarial Instruction Translation (for Negative Samples):

Starts from a valid trajectory segment.
Uses a vision-language model (GPT-5.2) to generate an unpaired task instruction that is plausible in the same interface context but does not match the demonstrated segment.
The model outputs a justification for the mismatch and a reference step (time index where mismatch becomes evident), providing hard negatives and attribution labels.

3. Spatiotemporal Token Pruning (for Efficient Training):

Spatial Token Pruning (STP): Removes visually homogeneous regions within each frame (e.g., large static backgrounds). It constructs a per-frame UI-connected graph and identifies large connected components for pruning.
- Algorithm 2 details the process. Patches are connected if they are neighbors and their feature distance is below a threshold $\tau_s$ .
- The spatial mask $M_s^{(t)}(i, j)$ is defined as: $M_s^{(t)}(i, j) = \begin{cases} 0 & \text{if } C^{(t)}(i, j) \in R^{(t)} \\ 1 & \text{otherwise} \end{cases}$ where $C^{(t)}(i, j)$ is the component containing patch $(i, j)$ and $R^{(t)$ }$ are the large components selected for pruning.
Temporal Token Pruning (TTP): Removes tokens that remain nearly unchanged across consecutive frames, focusing the model on meaningful state transitions.
- Algorithm 3 details the process. For each spatial location $i$ , a reference token $v_i^{(\text{ref})}$ is maintained from the first frame.
- The temporal mask $M_t(t, i)$ is set to 1 if the cosine similarity between the reference and current token is $\leq \tau_t$ , otherwise 0 (pruned).
- The reference is updated: $v_i^{(\text{ref})} \gets v_i^{(t)}$ if $M_t(t, i) = 1$ , otherwise it remains unchanged.
Combined Training: Algorithm 1 outlines the overall training process with spatiotemporal token pruning. The final token mask $M$ is the conjunction ( $\land$ ) of the spatial and temporal masks. Pruned tokens are dropped, and the remaining sequence is packed for input to the LLM.

4. Model Architecture & Training:

The Execution Video Reward Model (ExeVRM) is fine-tuned based on Qwen3-VL (4B and 8B versions).
The model takes a user instruction and an execution video sequence as input and outputs a judgment of task success.
Training uses a learning rate of $5 \times 10^{-6}$ with a cosine decay schedule on 8 x NVIDIA A100 80GB GPUs.

Empirical Validation / Results

Evaluation Benchmark (ExeVR-Bench):

Built from a held-out split of ExeVR-53k.
Contains 789 instances with approximately balanced class ratio (49.94% positive vs. 50.06% negative).
Evaluates two settings: (i) binary judgment (correct vs. incorrect), and (ii) attribution judgment (requires a time range indicating where the first error occurs).
Metrics: Accuracy, Precision, Recall, and Temporal Intersection-over-Union (tIoU) for localization quality.

Key Quantitative Results:

Table 2: Detailed performance on ExeVR-Bench

Model	Ubuntu (Agent) Acc. Prec. Rec.	Ubuntu (Human) Acc. Prec. Rec.	Mac/Win Acc. Prec. Rec.	Android Acc. Prec. Rec.	Overall Acc. Prec. Rec.
ExeVRM 8B (Ours)	82.5 85.9 77.7	84.0 84.0 84.0	89.0 85.5 94.0	83.5 77.2 95.0	84.7 82.9 87.7
Seed-2.0 Pro	85.1 85.1 85.1	77.2 81.7 69.1	81.0 86.1 74.0	78.0 82.6 71.0	80.3 83.9 74.7
GPT-5.2	82.5 85.1 78.7	74.0 84.3 59.0	74.5 88.9 68.7	74.5 75.5 75.8	75.0 82.7 66.5
Gemini 3 Pro	80.4 75.0 90.0	71.2 71.7 71.0	75.6 75.3 76.8	73.5 74.7 71.0	75.1 74.2 76.7

ExeVRM 8B achieves the best overall balance, outperforming proprietary and open-source baselines.
Gains are consistent across Ubuntu (Agent/Human), Mac/Win, and Android settings.

Attribution Quality (tIoU):

ExeVRM attains consistently higher temporal IoU than all baselines (see Figure 3), indicating more precise credit assignment over time.

Ablation Studies & Key Findings:

Table 4: Effect of input resolution

Model	Resolution	Accuracy	Precision	Recall
Qwen3-VL 4B	360p	79.3	80.6	77.8
Qwen3-VL 4B	720p (w/STP & TTP)	80.1	79.2	82.5
Qwen3-VL 8B	360p	81.5	82.5	80.5
Qwen3-VL 8B	720p (w/STP & TTP)	84.7	82.9	87.7

Finding 1: Dense video context outperforms sparse snapshot methods (e.g., judging only final or initial screenshots).
Finding 2: Higher resolution (720p) brings more benefit for reward modeling, especially improving recall, while STP/TTP keeps training tractable.
Finding 3: Asymmetric effects of pruning. TTP alone provides the strongest overall balance (80.3 accuracy / 79.3 recall), while STP alone yields lower performance. The combination (STP+TTP) maintains robust performance with a precision-recall trade-off.
Finding 4: Spatiotemporal pruning improves training efficiency. Combining STP and TTP keeps GPU memory footprint and per-step training time lower compared to using only one method, especially as trajectory length increases (see Figure 4).

Theoretical and Practical Implications

Scalable Evaluation: Provides a model-agnostic, scalable evaluator for CUAs based on observable execution video, moving beyond environment-specific parsers and handcrafted rules.
Data Curation Paradigm: Introduces a method (adversarial instruction translation) to synthetically generate high-quality negative samples, addressing a major data bottleneck in reward modeling.
Efficient Video Processing: Demonstrates that spatiotemporal token pruning is crucial for handling long, high-resolution GUI videos, making detailed reward modeling practical.
Improved Debugging: The model's strong temporal attribution (high tIoU) can highlight the exact interaction steps that cause failures, enabling faster debugging and more targeted data collection for agent development.

Conclusion

The paper presents a video-execution paradigm for reward modeling of computer-use agents. Key contributions include:

The ExeVR-53k dataset.
Adversarial instruction translation for synthetic negative generation.
Spatiotemporal token pruning (STP & TTP) for efficient high-resolution video training.
The ExeVRM model, which achieves state-of-the-art performance (84.7% accuracy, 87.7% recall) on video-execution assessment across multiple platforms, outperforming strong baselines and providing more precise temporal attribution. This approach enables scalable, model-agnostic evaluation of CUAs.