OpenComputer: Verifiable Software Worlds for Computer-Use Agents

Summary (Overview)

Verifier-Grounded Framework: OpenComputer introduces a framework for synthesizing realistic, executable desktop software tasks where verification of application state is the central design principle, enabling automatic generation of environments with machine-checkable success criteria.
Four-Component Pipeline: The system integrates: 1) App-specific state verifiers for structured inspection, 2) a self-evolving verification layer that improves verifiers using execution feedback, 3) a verifier-aware task generation pipeline, and 4) an evaluation harness for trajectory recording and auditable partial-credit rewards.
Large-Scale Benchmark: The release covers 33 desktop applications and 1,000 finalized tasks across browsers, office tools, creative software, development environments, file managers, and communication apps.
Challenging for Current Agents: Experiments show frontier agents (GPT-5.4, Claude-Sonnet-4.6, Kimi-K2.6) achieve success rates between 58.8% and 68.3%, while open-source models exhibit sharp performance drops compared to scores on existing benchmarks like OSWorld, indicating a persistent gap in robust computer automation.
Superior Verification: Hard-coded programmatic verifiers demonstrate significantly higher alignment with human adjudication (94.1% task-level agreement) compared to LLM-as-judge evaluation (79.2%), especially for tasks requiring fine-grained application state inspection beyond screenshots.

Introduction and Theoretical Foundation

Computer-use agents that operate native desktop software interfaces represent a promising path toward general-purpose AI. However, scaling their training and evaluation is bottlenecked by the high cost of constructing realistic, reproducible desktop environments and tasks. Manually creating a task involves designing a plausible user goal, preparing the underlying environment state (files, configurations, data), and ensuring coherence and reproducibility—a tedious, application-specific process.

Beyond construction, trustworthy verification is equally challenging. Success in desktop settings is often reflected in application state, file contents, metadata, or persistent side effects, not just visible screenshots. While LLM-as-a-judge is a natural fallback, it introduces limitations: sensitivity to prompts, incomplete observations, model biases, and difficulty in auditing. Crucially, LLM judges may reward outcomes that appear plausible visually while missing errors in the underlying software state.

OpenComputer addresses these dual bottlenecks by making verification the organizing principle of environment and task construction. It formalizes the problem as synthesizing a verifiable computer-use task instance:

\tau = (x, e, c)

where $x$ is the task description, $e$ is an executable environment initialization procedure, and $c$ is a set of machine-checkable success criteria. The agent interacts from an initial state $s_0 \sim e$ to a final state $s_T$ .

The core challenge is cast as a constrained synthesis problem: given an application $a$ and a goal $g$ , generate $\tau$ such that the environment is realistic, the target state is reachable, and success can be checked programmatically. OpenComputer solves this via three coupled components:

A verifier generator $V(a) \rightarrow V_a$
A verifier-evolution procedure $U(V_a, D_a) \rightarrow V_a^+$ using calibration executions $D_a$
A verifier-aware task and environment synthesis pipeline $E(a, g, V_a^+) \rightarrow e$

Methodology

OpenComputer consists of four tightly coupled layers, as illustrated in Figure 1 of the paper.

1. Verification Stack

Verifier Generation: For each supported application $a \in A$ , a synthetic Python verifier module $V_a$ is built. It exposes CLI subcommands with JSON outputs, covering all reliably inspectable state surfaces: content, preferences, plugins, history, bookmarks, file I/O, project structure, media state, graphical attributes, and metadata.

Inspection Channels: Verifiers query the most reliable application-specific channels available in the sandbox (see Figure 2), such as:
- File Parsing: for Inkscape, Krita, Draw.io, FreeCAD, Blender, Godot4.
- SQLite: for Chrome, Firefox, VSCode, Darktable, Obsidian, Zotero.
- CDP (Chrome DevTools Protocol): for Chrome, Brave, VSCode (Electron), Slack.
- D-Bus / AT-SPI: for VLC, Gedit, Galculator.
- Headless Python APIs: for Blender.
Verifier Testing: Each verifier is treated as a software artifact with a written test plan, live integration tests, and a debug-fix-retry loop to ensure reliability.

Self-Evolving Verification Layer: This layer refines verifiers using execution-grounded feedback to expose residual issues.

Calibration Executions: ~15 easy-to-medium tasks per application are run by a strong agent, caching the final state $s_T$ .
Disagreement Diagnosis: An LLM evaluator and the programmatic verifier produce independent verdicts. Disagreements are analyzed; those attributed to verifier-side errors (e.g., brittle assumptions, incomplete coverage) are used as feedback.
Bounded Verifier Refinement: The verifier ( $V_a$ ) is iteratively updated to $V_a^+$ by modifying checker code, endpoints, or documentation—not the task or trajectory—until agreement is reached or a budget is exhausted.

2. Task Generation Pipeline

Tasks are synthesized through a verifier-aware process:

Proposal: Candidate tasks are proposed from realistic user goals.
Filtering: Tasks are filtered for complexity (prioritizing multi-step workflows) and data generatability.
Verification Grounding: Accepted proposals are matched against the verification stack. If an outcome is inspectable but not covered, the verifier is extended with a new endpoint.
Environment Synthesis: Required files, folders, profiles, and configurations are generated and packaged into the final task instance $\tau = (x, e, c)$ .
Task-Extension Workflow: Periodic reviews identify coverage gaps, and new tasks are generated for missing workflows.

3. Evaluation Harness and Reward Computation

At evaluation time:

A fresh sandbox is initialized with $s_0 \sim e$ .
The agent interacts via a screenshot-action loop.
After the agent stops, a final save action is attempted for persistence-critical apps.
Reward Computation: The verifier's checker commands are executed. The reward is the fraction of passed checks: $R = N_{\text{pass}} / N_{\text{total}}$ This provides partial credit while preserving exact, machine-checkable conditions.

4. OpenComputer Release

The released infrastructure includes:

33 desktop applications and 1,000 finalized tasks.
App-specific verifier modules, task specs, and initialization scripts.
An execution harness supporting local (Docker) and cloud-scale deployment (AWS, Tencent Cloud, E2B).

Table 1: Summary statistics of the OpenComputer benchmark.

Applications	Tasks	Avg. Verifier Endpoints / App	Avg. Checks / Task	Avg. Seed Files / Task
33	1000	17.7	6.9	这三种 1.3

Empirical Validation / Results

Experimental Setup

Benchmark: The finalized OpenComputer suite (1000 tasks, 33 apps).
Models: Mix of frontier and open-source computer-use agents: GPT-5.4, Claude-Sonnet-4.6, Kimi-K2.6, Gemini-3-Flash, Qwen-3.5-27B, Qwen-3.5-9B, EvoCUA-8B, GUI-OWL-1.5-8B.
Metrics:
- Success Rate: Fraction of tasks where all criteria are satisfied.
- Average Reward: Mean fraction of passed verifier checks (partial credit).
- Avg. Steps & Time/Step: Interaction efficiency.

Main Results Analysis

Table case: Performance and efficiency comparison across computer-use agents.

Model	OSWorld	Success Rate	Avg. Steps	Time/Step	Avg. Reward
GPT-5.4	75.0%	68.3%	19.0	16.5 s	88.4%
Claude-Sonnet-4.6	72.5%	64.4%	31.5	20.8 s	76.6%
Kimi-K2.6	73.1%	58.8%	35.7	33.0 s	70.7%
Qwen-3.5-27B	56.2%	32.3%	33.1	57.3 s	59.4%
Gemini-3-Flash	–	16.4%	25.4	9.0 s	37.0%
EvoCUA-8B	46.1%	10.9%	67.0	9.7 s	38.1%
Qwen-3.5-9B	41.8%	7.8%	39.3	17.8 s	31.7%
GUI-OWL-1.5-8B	52.3%	5.7%	73.6	9.43 s	27.8%

Key Findings:

Frontier agents struggle with end-to-end completion. GPT-5.4 leads with a 68.3% success rate, failing on nearly one-third of tasks.
GPT-5.4 is most efficient. Its lower average steps (19.0) are attributed to combining multiple operations per step and emitting only actions, not long reasoning traces.
Sharp drop for open-source models. Models like GUI-OWL-1.5-8B show a dramatic decline from their OSWorld scores (52.3% → 5.7%), indicating limited cross-benchmark generalization and that OpenComputer presents a broader, more heterogeneous challenge.

Analysis

1. Agentic LLM-as-Judge vs. Hard-Coded Verification A comparison on 120 human, annotated tasks shows hard-coded verifiers align significantly better with human judgment.

Figure 3: Alignment with human adjudication on a 120-task comparison set.

Task-Level Agreement: Hard-coded Verifier: 94.1% (113/120) vs. LLM Judge: 79.2% (95/120).
Checklist Agreement: Hard-coded Verifier: 97.3% vs. LLM Judge: 92.2%.

"In dense desktop interfaces, semantically important mistakes are often visually tiny... These runs can look approximately correct from pixels alone. A hard-coded verifier instead reads the exact application state and can thus distinguish near-miss visual outputs from true task completion."

2. Comparing GUI Agents with CLI Agents On a CLI-compatible subset (14 apps, 343 tasks), GUI agents achieved higher success rates, but CLI agents (Claude Code) were substantially faster.

Table 3: Overall GUI–CLI pass-rate and execution-time comparison.

Setting	Model	Success Rate (%)	Time (s)
GUI	GPT-5.4	75.2	288
GUI	Claude Sonnet 4.6	73.0	622
CLI	Claude Sonnet 4.6	67.2	141

3. Ablation: Self-Evolving Verification The self-evolution layer effectively identifies and repairs checker-side errors.

Table 4: Repair efficiency and human-checker agreement improvement.

Metric	Value
Fixed in 1 round	47
Fixed in 2 rounds	15
Fixed in 3 rounds	6
Not fixed within budget	8
Agreement before evolution	85.2%
Agreement after evolution	94.1% (+8.9%)

Out of 76 identified checker-side errors, 68 were repaired (89.4% repair rate), improving human-checker agreement by 8.9 percentage points.

Theoretical and Practical Implications

Infrastructure for Scaling Research: OpenComputer provides a foundation for scaling computer-use research by coupling realistic software worlds with machine-checkable feedback. This supports not only evaluation but also grounded trajectory collection for training via SFT, rejection sampling, or RL.
Verification as a Core Design Principle: The work demonstrates that making inspectable application state a central constraint enables automatic, scalable generation of auditable and reproducible benchmarks, moving beyond manual curation or unreliable proxy evaluation.
Exposing Agent Limitations: The benchmark reveals that current agents, especially open-source ones, lack robustness in handling the diversity and fine-grained state dependencies of real desktop workflows. The gap between OSWorld and OpenComputer scores highlights the need for benchmarks that test generalization across broader software settings.
Evaluation Methodology: The superior performance of hard-coded verifiers over LLM judges underscores the importance of grounding evaluation in exact application state for tasks where success is not fully visually inferable.

Conclusion

OpenComputer presents a verifier-grounded framework for constructing verifiable software worlds, addressing key bottlenecks in environment construction and trustworthy verification for computer-use agents. The released benchmark, spanning 33 applications and 1000 tasks, proves challenging for current frontier and open-source models, exposing a persistent gap in robust desktop automation.

The framework highlights that progress requires trustworthy environments, grounded rewards, and reproducible construction pipelines alongside stronger models. By providing infrastructure for scalable, auditable task synthesis and evaluation, OpenComputer aims to make future computer-use systems more reliable, measurable, and aligned with real software outcomes.

Limitations and Future Work: Not all realistic desktop tasks can be fully reduced to programmatic checks (e.g., those requiring visual/spatial judgments like arrow connectivity in diagrams). The current benchmark excludes such tasks (17 identified cases) to maintain auditability. Future work could explore hybrid verification combining executable state checks with visual judgments for these scenarios. The excluded tasks will be released for diagnostic analysis.