# Video-Based Reward Modeling for Computer-Use Agents

> The paper introduces a reward model that evaluates computer-use agents by analyzing execution videos, achieving 84.7% accuracy and outperforming proprietary models like GPT-5.2.

- **Source:** [arXiv](https://arxiv.org/abs/2603.10178)
- **Published:** 2026-03-14
- **Permalink:** https://picx.dev/p/XEKSPC
- **Whiteboard:** https://picx.dev/p/XEKSPC/image

## Summary

Here is a comprehensive, well-structured summary of the academic paper "Video-Based Reward Modeling for Computer-Use Agents":

## Summary (Overview)
*   **Main Contribution:** Introduces a novel method for evaluating Computer-Use Agents (CUAs) by modeling rewards from **execution videos** (keyframe sequences of an agent's trajectory), which is independent of the agent's internal reasoning or actions.
*   **Key Dataset:** Introduces **Execution Video Reward 53k (ExeVR-53k)** dataset, containing 53k high-quality video-task-reward triplets for training reward models.
*   **Key Technique:** Proposes **adversarial instruction translation** to synthesize negative samples with step-level annotations, addressing the scarcity of failure data.
*   **Key Innovation:** Designs **spatiotemporal token pruning (STP & TTP)** to enable efficient learning from long, high-resolution execution videos by removing redundant tokens while preserving decisive UI changes.
*   **Key Result:** Fine-tunes an **Execution Video Reward Model (ExeVRM)** that achieves **84.7% accuracy** and **87.7% recall** on video-execution assessment, outperforming strong proprietary models like GPT-5.2 and Gemini-3 Pro across multiple platforms.

## Introduction and Theoretical Foundation
**Background & Motivation:** Computer-use agents are advancing rapidly, but scalable evaluation of whether an agent's trajectory truly fulfills a user instruction remains challenging. Traditional benchmarks rely on handcrafted scripts or task-specific rules, limiting scalability. A learned reward model that judges task success based on observable execution is a more flexible alternative.

**Theoretical Basis:** The research focuses on **execution video**—the observable sequence of interface states during interaction—as a method-agnostic representation comparable across different agent designs. This approach addresses two core challenges:
1.  **High Redundancy:** CUA trajectories contain large static interface regions, while correctness depends on subtle local changes.
2.  **Limited Negative Supervision:** Public datasets are dominated by successful trajectories, making it hard to build balanced training data for reward models.

## Methodology
**1. Dataset Construction (ExeVR-53k):**
*   Unifies interaction data from multiple sources: AgentNet, ScaleCUA, and OSWorld.
*   Converts logs into a consistent step-level video representation: trajectories are segmented into atomic interaction steps, and a representative key frame is extracted per step.
*   The sequence of keyframes forms a compact video summary rendered at 1 FPS.

**2. Adversarial Instruction Translation (for Negative Samples):**
*   Starts from a valid trajectory segment.
*   Uses a vision-language model (GPT-5.2) to generate an **unpaired task instruction** that is plausible in the same interface context but does **not** match the demonstrated segment.
*   The model outputs a justification for the mismatch and a reference step (time index where mismatch becomes evident), providing hard negatives and attribution labels.

**3. Spatiotemporal Token Pruning (for Efficient Training):**
*   **Spatial Token Pruning (STP):** Removes visually homogeneous regions within each frame (e.g., large static backgrounds). It constructs a per-frame UI-connected graph and identifies large connected components for pruning.
    *   Algorithm 2 details the process. Patches are connected if they are neighbors and their feature distance is below a threshold $\tau_s$.
    *   The spatial mask $M_s^{(t)}(i, j)$ is defined as:
        $$M_s^{(t)}(i, j) = \begin{cases} 0 & \text{if } C^{(t)}(i, j) \in R^{(t)} \\ 1 & \text{otherwise} \end{cases}$$
        where $C^{(t)}(i, j)$ is the component containing patch $(i, j)$ and $R^{(t)$}$ are the large components selected for pruning.
*   **Temporal Token Pruning (TTP):** Removes tokens that remain nearly unchanged across consecutive frames, focusing the model on meaningful state transitions.
    *   Algorithm 3 details the process. For each spatial location $i$, a reference token $v_i^{(\text{ref})}$ is maintained from the first frame.
    *   The temporal mask $M_t(t, i)$ is set to 1 if the cosine similarity between the reference and current token is $\leq \tau_t$, otherwise 0 (pruned).
    *   The reference is updated: $v_i^{(\text{ref})} \gets v_i^{(t)}$ if $M_t(t, i) = 1$, otherwise it remains unchanged.
*   **Combined Training:** Algorithm 1 outlines the overall training process with spatiotemporal token pruning. The final token mask $M$ is the conjunction ($\land$) of the spatial and temporal masks. Pruned tokens are dropped, and the remaining sequence is packed for input to the LLM.

**4. Model Architecture & Training:**
*   The **Execution Video Reward Model (ExeVRM)** is fine-tuned based on **Qwen3-VL** (4B and 8B versions).
*   The model takes a user instruction and an execution video sequence as input and outputs a judgment of task success.
*   Training uses a learning rate of $5 \times 10^{-6}$ with a cosine decay schedule on 8 x NVIDIA A100 80GB GPUs.

## Empirical Validation / Results
**Evaluation Benchmark (ExeVR-Bench):**
*   Built from a held-out split of ExeVR-53k.
*   Contains 789 instances with approximately balanced class ratio (49.94% positive vs. 50.06% negative).
*   Evaluates two settings: (i) binary judgment (correct vs. incorrect), and (ii) attribution judgment (requires a time range indicating where the first error occurs).
*   Metrics: Accuracy, Precision, Recall, and Temporal Intersection-over-Union (tIoU) for localization quality.

**Key Quantitative Results:**

**Table 2: Detailed performance on ExeVR-Bench**
| Model | Ubuntu (Agent) Acc. Prec. Rec. | Ubuntu (Human) Acc. Prec. Rec. | Mac/Win Acc. Prec. Rec. | Android Acc. Prec. Rec. | **Overall Acc. Prec. Rec.** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **ExeVRM 8B (Ours)** | **82.5** **85.9** **77.7** | **84.0** **84.0** **84.0** | **89.0** **85.5** **94.0** | **83.5** **77.2** **95.0** | **84.7** **82.9** **87.7** |
| Seed-2.0 Pro | 85.1 85.1 85.1 | 77.2 81.7 69.1 | 81.0 86.1 74.0 | 78.0 82.6 71.0 | 80.3 83.9 74.7 |
| GPT-5.2 | 82.5 85.1 78.7 | 74.0 84.3 59.0 | 74.5 88.9 68.7 | 74.5 75.5 75.8 | 75.0 82.7 66.5 |
| Gemini 3 Pro | 80.4 75.0 90.0 | 71.2 71.7 71.0 | 75.6 75.3 76.8 | 73.5 74.7 71.0 | 75.1 74.2 76.7 |

*   **ExeVRM 8B achieves the best overall balance**, outperforming proprietary and open-source baselines.
*   Gains are consistent across Ubuntu (Agent/Human), Mac/Win, and Android settings.

**Attribution Quality (tIoU):**
*   ExeVRM attains consistently higher temporal IoU than all baselines (see Figure 3), indicating more precise credit assignment over time.

**Ablation Studies & Key Findings:**

**Table 4: Effect of input resolution**
| Model | Resolution | Accuracy | Precision | Recall |
| :--- | :--- | :--- | :--- | :--- |
| Qwen3-VL 4B | 360p | 79.3 | 80.6 | 77.8 |
| Qwen3-VL 4B | 720p (w/STP & TTP) | **80.1** | 79.2 | **82.5** |
| Qwen3-VL 8B | 360p | 81.5 | 82.5 | 80.5 |
| Qwen3-VL 8B | 720p (w/STP & TTP) | **84.7** | **82.9** | **87.7** |

*   **Finding 1:** Dense video context outperforms sparse snapshot methods (e.g., judging only final or initial screenshots).
*   **Finding 2:** Higher resolution (720p) brings more benefit for reward modeling, especially improving recall, while STP/TTP keeps training tractable.
*   **Finding 3:** Asymmetric effects of pruning. TTP alone provides the strongest overall balance (80.3 accuracy / 79.3 recall), while STP alone yields lower performance. The combination (STP+TTP) maintains robust performance with a precision-recall trade-off.
*   **Finding 4:** Spatiotemporal pruning improves training efficiency. Combining STP and TTP keeps GPU memory footprint and per-step training time lower compared to using only one method, especially as trajectory length increases (see Figure 4).

## Theoretical and Practical Implications
*   **Scalable Evaluation:** Provides a model-agnostic, scalable evaluator for CUAs based on observable execution video, moving beyond environment-specific parsers and handcrafted rules.
*   **Data Curation Paradigm:** Introduces a method (adversarial instruction translation) to synthetically generate high-quality negative samples, addressing a major data bottleneck in reward modeling.
*   **Efficient Video Processing:** Demonstrates that spatiotemporal token pruning is crucial for handling long, high-resolution GUI videos, making detailed reward modeling practical.
*   **Improved Debugging:** The model's strong temporal attribution (high tIoU) can highlight the exact interaction steps that cause failures, enabling faster debugging and more targeted data collection for agent development.

## Conclusion
The paper presents a **video-execution paradigm** for reward modeling of computer-use agents. Key contributions include:
1.  The **ExeVR-53k** dataset.
2.  **Adversarial instruction translation** for synthetic negative generation.
3.  **Spatiotemporal token pruning (STP & TTP)** for efficient high-resolution video training.
4.  The **ExeVRM** model, which achieves state-of-the-art performance (84.7% accuracy, 87.7% recall) on video-execution assessment across multiple platforms, outperforming strong baselines and providing more precise temporal attribution. This approach enables scalable, model-agnostic evaluation of CUAs.

---

_Markdown view of https://picx.dev/p/XEKSPC, served by PicX — AI-generated visual whiteboard summaries of research papers._
