GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Summary (Overview)

Introduces GameWorld, a comprehensive benchmark with 34 diverse browser games and 170 tasks for standardized evaluation of Multimodal Large Language Model (MLLM) agents, spanning five genres (Runner, Arcade, Platformer, Puzzle, Simulation).
Proposes a unified evaluation framework featuring a browser-based sandbox that decouples inference latency from gameplay and an outcome-based state-verifiable evaluator that uses serialized game state for deterministic, noise-free assessment.
Studies two agent interfaces: Computer-Use Agents (CUAs) emitting low-level keyboard/mouse controls and Generalist Multimodal Agents acting via deterministic Semantic Action Parsing into a shared executable action space.
Key Findings: The best-performing agents (e.g., Gemini-3-Flash-Preview, Seed-1.8) achieve overall progress (PG) of ~40% but low success rates (SR ~20%), remaining far from novice human performance (SR 55.3%, PG 64.1%). Agents perform relatively well on strategic reasoning and reactive control but struggle with basic timing grounding, spatial navigation, and long-horizon coordination.
Provides extensive analyses on benchmark robustness, real-time interaction (GameWorld-RT), context-memory sensitivity, and action validity, revealing distinct challenges and trade-offs between the two agent interfaces.

Introduction and Theoretical Foundation

Developing embodied generalist agents for real-world interaction faces challenges like latency, sparse feedback, and irreversible mistakes. Video games, particularly browser games, offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematic evaluation of MLLMs as game agents is hindered by heterogeneous action interfaces and reliance on heuristic verification methods (e.g., OCR, VLM-as-judge), which introduce noise and reduce reproducibility.

Prior benchmarks (e.g., LMGame-Bench, BALROG, VideoGame-Bench) have improved scale and realism but lack standardized, verifiable evaluation. GameWorld aims to bridge this gap by providing a standardized, comprehensive, and verifiable benchmark for multimodal game agents in browser environments. The core motivation is to create a reproducible measurement platform that isolates decision quality from inference speed and provides deterministic evaluation based on game state outcomes.

Methodology

1. Benchmark Design & Components

GameWorld consists of four integrated modules (see Fig. 2):

MLLMs as Game Agents: Implements two agent interfaces.
Browser-based Sandbox Environment: Manages game execution, can pause during inference, and ensures an isolated observation-action loop.
Games & Tasks Library: 34 games across 5 genres with 170 natural-language task instructions.
Outcome-based State-Verifiable Evaluation: Uses a JavaScript bridge to access serialized gameAPI state for deterministic metric computation.

2. Agent Interfaces and Action Space

Two agent interfaces are defined and normalized to a unified control space of atomic human-computer interaction events (mouse_move, key_down, wait, etc.).

Table 1: Two game agent interfaces and action-space taxonomy.

Game Agent Interfaces	Action Space Description
Computer-Use Agent (CUA)	Action Space: Computer-use function calls. Native tools of mouse and keyboard events (e.g., `left_click(x,y)`, `press_key(key)`).
Generalist Multimodal Agent	Action Space: Game-specific function calls. Semantic functions parsed into low-level controls (e.g., `move_forward()`, `action_jump()`).
Unified Control Space (Atomic Events)	`Mouse: mouse_move(x,y) mouse_down(button) mouse_up(button) scroll(amount)` <br> `Keyboard: key_down(key) key_up(key)` <br> `Others: wait(duration) idle()`

Computer-Use Agents (CUAs): Emit low-level controls directly. Must output exactly one executable action per step.
Generalist Agents: Emit high-level semantic actions. A deterministic Semantic Action Parser maps each semantic action to a fixed low-level command. Also enforces one action per step.

3. Agent Harnesses

A shared agent harness standardizes components across all models:

Structured Prompt: Fixed template with #Game Rules, #Role and Controls, #Task Instruction, and #Output Format.
Context Memory: Rolling memory module storing recent interaction rounds (user_prompt → screenshot → reasoning → action).
Reasoning: Supports deliberate thinking, which can aid planning but adds latency.
Customized Function Calling: Game actions are registered as callable tools using each model's native function-calling interface.

4. Games and Tasks

Table wr3: Game genre of the GameWorld benchmark (Summary).

Genre (# Games)	Key Mechanics	Example Games
Arcade (7)	Fast-paced, reactive control, multi-entity tracking.	`pacman`, `breakout`, `google-snake`
Platformer (8)	Precise, physics-aware spatial navigation.	`mario-game`, `captaincallisto`, `doodle-jump`
Puzzle (7)	Discrete state-space, logical reasoning, long-horizon planning.	`2048`, `minesweeper`, `tetris`
Runner (8)	High-frequency reactive control, precise timing.	`temple-run-2`, `flappy-bird`, `chrome-dino`
Simulation (4)	Open-ended, multi-objective, resource management.	`minecraft-clone`, `monkey-mart`, `fireboy-and-watergirl`

Each task has a natural-language instruction, a quantitative target, and a verifiable evaluator. Evaluation uses two metrics:

Success Rate (SR): Fraction of runs meeting the target.
Progress (PG): Normalized measure of advancement toward the objective.

5. Outcome-Based State-Verifiable Evaluation

Unlike heuristic methods, GameWorld's evaluator reads serialized gameAPI state (e.g., score, coordinates, lives) directly via an injected JavaScript bridge. This yields deterministic, noise-free signals. At each step, the evaluator computes a task score $q_{i,t}$ and the run-level best progress:

\text{progress}_i = \text{clip}_{[0,1]} \left( \frac{q_i^{\max} - b_i}{\tau_i - b_i} \right)

where $b_i$ is the starting score, $\tau_i$ is the target score, and $q_i^{\max} = \max_t q_{i,t}$ . The model's overall metrics are then averaged over all runs $R$ ( $N = |R|$ ):

\text{SR} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\text{status}_i = \text{success}], \quad \text{PG} = \frac{1}{N} \sum_{i=1}^{N} \text{progress}_i

Empirical Validation / Results

1. Main Results

Table 6: Main results on GameWorld across 34 games and 170 tasks.

Model	Arcade SR/PG	Platformer SR/PG	Puzzle SR/PG	Runner SR/PG	Simulation SR/PG	Overall SR/PG	Rank
Human Novice Player	45.7 / 55.5	60.0 / 65.6	51.4 / 63.1	60.0 / 72.0	60.0 / 62.0	55.3 / 64.1	–
Human Expert Player	65.7 / 73.9	85.0 / 88.0	68.6 / 77.1	82.5 / 87.8	85.0 / 86.0	77.1 / 82.6	–
Computer-Use Agents
Seed-1.8	8.6 / 31.1	25.0 / 40.3	25.7 / 52.0	27.5 / 50.6	5.0 / 11.0	20.0 / 39.8	1
Claude-Sonnet-4.6	8.6 / 27.2	22.5 / 36.5	20.0 / 43.8	30.0 / 55.6	10.0 / 16.8	19.4 / 38.3	2
Gemini-2.5-Computer-Use	5.7 / 28.0	20.0 / 35.8	11.4 / 32.2	30.0 / 55.4	10.0 / 19.3	16.5 / 36.1	3
Generalist Multimodal Agents
Gemini-3-Flash-Preview	5.7 / 26.3	25.0 / 41.2	25.7 / 54.8	32.5 / 55.4	10.0 / 21.1	21.2 / 41.9	1
GPT-5.2	8.6 / 29.3	22.5 / 36.7	28.6 / 56.2	27.5 / 52.6	10.0 / 16.9	20.6 / 40.6	2
Claude-Sonnet-4.6	5.7 / 28.3	22.5 / 37.0	25.7 / 51.5	30.0 / 51.9	15.0 / 16.6	20.6 / 39.3	3

Key Findings:

The best agents (Gemini-3-Flash-Preview PG=41.9%, Seed-1.8 CUA PG=39.8%) are far from novice human performance (PG=64.1%).
Overall Success Rates (SR) are low (12.4–21.2%), indicating agents often make partial progress but fail to complete tasks.
Runner games yield the highest progress for many models. Simulation tasks are most challenging.

2. Benchmark Robustness Under Repeated Evaluation

Repeated full-benchmark runs (10x) on open-source models show stable aggregate measurements, supporting GameWorld's reproducibility.

Table 7: Repeat-averaged overall SR and PG.

Model	Agent Interface	Repeats	Overall SR	Overall PG
Qwen3-VL-30B-A3B	Computer-Use Agent	10	12.7 ± 1.2	30.9 ± 1.1
Qwen3-VL-30B-A3B	Generalist Agent	10	12.5 ± 1.3	30.7 ± 1.1
Qwen3-VL-235B-A22B	Computer-Use Agent	10	13.8 ± 0.7	30.4 ± -0.7
Qwen3-VL-235B-A22B	Generalist Agent	10	13.6 ± 1.4	30.1 ± 0.5

3. Capability-Aligned Curriculum Analysis

Games are grouped into a five-level curriculum based on dominant capability bottlenecks (see Fig. 5):

Level-1 (Basic Control & Timing Grounding): e.g., breakout, stack. Tests basic action grounding.
Level-2 (System-1 Reactive Control): e.g., flappy-bird, temple-run-2. Tests high-frequency reflexes.
Level-3 (System-2 Spatial Navigation): e.g., mario-game, pacman. Tests deliberate pathfinding.
Level-4 (Symbolic Reasoning & Strategy): e.g., 2048, tetris. Tests strategic planning.
Level-5 (Open-World Coordination & Management): e.g., minecraft-clone, monkey-mart. Tests long-horizon coordination.

Finding: Both interfaces peak at Level-4 (Reasoning) and Level-2 (Reactive Control), but performance drops sharply at Level-1 (Timing Grounding) and Level-5 (Long-Horizon Coordination).

4. Challenges and Analyses

Real-Time Interaction (GameWorld-RT): When the environment does not pause during inference, performance remains challenging. Latency becomes part of the task. Table 8: GameWorld-RT results.

Model Real-Time sec/step SR PG
Qwen3-VL-235B-A22B (CUA) 6.2 17.1 33.2
Qwen3-VL-30B-A3B (CUA) 2.4 15.6 33.0
Context-Memory Sensitivity: Increasing memory rounds raises latency and input tokens. Performance improves modestly for Generalist agents but declines for CUAs, suggesting low-level action traces are harder for models to interpret usefully. Table 9: Memory-round sensitivity.

Memory Rounds Model Input Tokens sec/step PG
0 Qwen3-VL-235B-A22B (GEN) 1278 5.5 30.0
2 Qwen3-VL-235B-A22B (GEN) 3052 8.6 30.6
157 Qwen3-VL-235B-A22B (CUA) 5627 12.8 28.7
Action Validity: The Invalid Action Rate (IAR) measures instruction-following reliability.
$\text{IAR} = 1 - \frac{\sum_{r \in R} \#\mathrm{valid\_actions}(r)}{\sum_{r \in R} \#\mathrm{proposed\_actions}(r)}$
Categories are No-Tool-Call (NTC) and Out-of-Space (OOS). Most proprietary models have near-zero IAR, while some open-source models (e.g., GLM-4.6V IAR=8.3%) struggle. Table 11: Invalid Action Rate (IAR) across agents.

Model IAR (%) NTC (%) OOS (%)
GLM-4.6V (GEN) 8.3 7.6 0.7
Qwen3-VL-30B-A3B (GEN) 2.7 2.7 <0.1
Overall Mean 0.8 0.8 0.0
Failure Modes: Four categories are identified: Perception failures (misreading visual state), Fine-grained action failures (mistimed execution), Instruction-following failures (violating controls), and Long-horizon memory failures (losing context or repeating loops).

Model	Real-Time sec/step	SR	PG
Qwen3-VL-235B-A22B (CUA)	6.2	17.1	33.2
Qwen3-VL-30B-A3B (CUA)	2.4	15.6	33.0

Memory Rounds	Model	Input Tokens	sec/step	PG
0	Qwen3-VL-235B-A22B (GEN)	1278	5.5	30.0
2	Qwen3-VL-235B-A22B (GEN)	3052	8.6	30.6
157	Qwen3-VL-235B-A22B (CUA)	5627	12.8	28.7

Model	IAR (%)	NTC (%)	OOS (%)
GLM-4.6V (GEN)	8.3	7.6	0.7
Qwen3-VL-30B-A3B (GEN)	2.7	2.7	<0.1
Overall Mean	0.8	0.8	0.0

Theoretical and Practical Implications

Standardization in Agent Evaluation: GameWorld demonstrates the importance and feasibility of creating standardized, verifiable, and reproducible benchmarks for interactive agents, moving beyond heuristic evaluation.
Interface-Aware Design: The study of two distinct agent interfaces (CUA vs. Generalist) under a shared runtime provides a framework for understanding the trade-offs between low-level control precision and high-level semantic planning, informing future agent architecture design.
Diagnostic Benchmarking: The introduced metrics (SR, PG), curriculum analysis, and failure mode categorization offer tools for diagnosing specific capability bottlenecks in agents (e.g., timing grounding vs. long-horizon planning), guiding targeted improvements.
Real-World Relevance: The challenges exposed—especially in real-time interaction and long-horizon coordination—directly mirror hurdles for deploying embodied AI in dynamic real-world environments, making game agents a relevant stepping stone.

Conclusion

GameWorld establishes a robust, standardized, and verifiable benchmark for evaluating multimodal game agents. The results across 34 games and 18 model-interface pairs reveal that while current agents can make partial progress, they remain far from reliable task completion and human-level performance. The analyses highlight distinct challenges: real-time interaction couples latency with performance, context-memory benefits are interface-dependent, and instruction-following reliability varies. GameWorld provides a reproducible foundation for advancing research on multimodal agents, with future work needed to automate task/action-space generation and improve agents' capabilities in timing, navigation, and long-horizon coordination.