GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Summary (Overview)

  • Introduces GameWorld, a comprehensive benchmark with 34 diverse browser games and 170 tasks for standardized evaluation of Multimodal Large Language Model (MLLM) agents, spanning five genres (Runner, Arcade, Platformer, Puzzle, Simulation).
  • Proposes a unified evaluation framework featuring a browser-based sandbox that decouples inference latency from gameplay and an outcome-based state-verifiable evaluator that uses serialized game state for deterministic, noise-free assessment.
  • Studies two agent interfaces: Computer-Use Agents (CUAs) emitting low-level keyboard/mouse controls and Generalist Multimodal Agents acting via deterministic Semantic Action Parsing into a shared executable action space.
  • Key Findings: The best-performing agents (e.g., Gemini-3-Flash-Preview, Seed-1.8) achieve overall progress (PG) of ~40% but low success rates (SR ~20%), remaining far from novice human performance (SR 55.3%, PG 64.1%). Agents perform relatively well on strategic reasoning and reactive control but struggle with basic timing grounding, spatial navigation, and long-horizon coordination.
  • Provides extensive analyses on benchmark robustness, real-time interaction (GameWorld-RT), context-memory sensitivity, and action validity, revealing distinct challenges and trade-offs between the two agent interfaces.

Introduction and Theoretical Foundation

Developing embodied generalist agents for real-world interaction faces challenges like latency, sparse feedback, and irreversible mistakes. Video games, particularly browser games, offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematic evaluation of MLLMs as game agents is hindered by heterogeneous action interfaces and reliance on heuristic verification methods (e.g., OCR, VLM-as-judge), which introduce noise and reduce reproducibility.

Prior benchmarks (e.g., LMGame-Bench, BALROG, VideoGame-Bench) have improved scale and realism but lack standardized, verifiable evaluation. GameWorld aims to bridge this gap by providing a standardized, comprehensive, and verifiable benchmark for multimodal game agents in browser environments. The core motivation is to create a reproducible measurement platform that isolates decision quality from inference speed and provides deterministic evaluation based on game state outcomes.

Methodology

1. Benchmark Design & Components

GameWorld consists of four integrated modules (see Fig. 2):

  1. MLLMs as Game Agents: Implements two agent interfaces.
  2. Browser-based Sandbox Environment: Manages game execution, can pause during inference, and ensures an isolated observation-action loop.
  3. Games & Tasks Library: 34 games across 5 genres with 170 natural-language task instructions.
  4. Outcome-based State-Verifiable Evaluation: Uses a JavaScript bridge to access serialized gameAPI state for deterministic metric computation.

2. Agent Interfaces and Action Space

Two agent interfaces are defined and normalized to a unified control space of atomic human-computer interaction events (mouse_move, key_down, wait, etc.).

Table 1: Two game agent interfaces and action-space taxonomy.

Game Agent InterfacesAction Space Description
Computer-Use Agent (CUA)Action Space: Computer-use function calls. Native tools of mouse and keyboard events (e.g., left_click(x,y), press_key(key)).
Generalist Multimodal AgentAction Space: Game-specific function calls. Semantic functions parsed into low-level controls (e.g., move_forward(), action_jump()).
Unified Control Space (Atomic Events)Mouse: mouse_move(x,y) mouse_down(button) mouse_up(button) scroll(amount) <br> Keyboard: key_down(key) key_up(key) <br> Others: wait(duration) idle()
  • Computer-Use Agents (CUAs): Emit low-level controls directly. Must output exactly one executable action per step.
  • Generalist Agents: Emit high-level semantic actions. A deterministic Semantic Action Parser maps each semantic action to a fixed low-level command. Also enforces one action per step.

3. Agent Harnesses

A shared agent harness standardizes components across all models:

  • Structured Prompt: Fixed template with #Game Rules, #Role and Controls, #Task Instruction, and #Output Format.
  • Context Memory: Rolling memory module storing recent interaction rounds (user_prompt → screenshot → reasoning → action).
  • Reasoning: Supports deliberate thinking, which can aid planning but adds latency.
  • Customized Function Calling: Game actions are registered as callable tools using each model's native function-calling interface.

4. Games and Tasks

Table wr3: Game genre of the GameWorld benchmark (Summary).

Genre (# Games)Key MechanicsExample Games
Arcade (7)Fast-paced, reactive control, multi-entity tracking.pacman, breakout, google-snake
Platformer (8)Precise, physics-aware spatial navigation.mario-game, captaincallisto, doodle-jump
Puzzle (7)Discrete state-space, logical reasoning, long-horizon planning.2048, minesweeper, tetris
Runner (8)High-frequency reactive control, precise timing.temple-run-2, flappy-bird, chrome-dino
Simulation (4)Open-ended, multi-objective, resource management.minecraft-clone, monkey-mart, fireboy-and-watergirl

Each task has a natural-language instruction, a quantitative target, and a verifiable evaluator. Evaluation uses two metrics:

  • Success Rate (SR): Fraction of runs meeting the target.
  • Progress (PG): Normalized measure of advancement toward the objective.

5. Outcome-Based State-Verifiable Evaluation

Unlike heuristic methods, GameWorld's evaluator reads serialized gameAPI state (e.g., score, coordinates, lives) directly via an injected JavaScript bridge. This yields deterministic, noise-free signals. At each step, the evaluator computes a task score qi,tq_{i,t} and the run-level best progress:

progressi=clip[0,1](qimaxbiτibi)\text{progress}_i = \text{clip}_{[0,1]} \left( \frac{q_i^{\max} - b_i}{\tau_i - b_i} \right)

where bib_i is the starting score, τi\tau_i is the target score, and qimax=maxtqi,tq_i^{\max} = \max_t q_{i,t}. The model's overall metrics are then averaged over all runs RR (N=RN = |R|):

SR=1Ni=1N1[statusi=success],PG=1Ni=1Nprogressi\text{SR} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{1}[\text{status}_i = \text{success}], \quad \text{PG} = \frac{1}{N} \sum_{i=1}^{N} \text{progress}_i

Empirical Validation / Results

1. Main Results

Table 6: Main results on GameWorld across 34 games and 170 tasks.

ModelArcade SR/PGPlatformer SR/PGPuzzle SR/PGRunner SR/PGSimulation SR/PGOverall SR/PGRank
Human Novice Player45.7 / 55.560.0 / 65.651.4 / 63.160.0 / 72.060.0 / 62.055.3 / 64.1
Human Expert Player65.7 / 73.985.0 / 88.068.6 / 77.182.5 / 87.885.0 / 86.077.1 / 82.6
Computer-Use Agents
Seed-1.88.6 / 31.125.0 / 40.325.7 / 52.027.5 / 50.65.0 / 11.020.0 / 39.81
Claude-Sonnet-4.68.6 / 27.222.5 / 36.520.0 / 43.830.0 / 55.610.0 / 16.819.4 / 38.32
Gemini-2.5-Computer-Use5.7 / 28.020.0 / 35.811.4 / 32.230.0 / 55.410.0 / 19.316.5 / 36.13
Generalist Multimodal Agents
Gemini-3-Flash-Preview5.7 / 26.325.0 / 41.225.7 / 54.832.5 / 55.410.0 / 21.121.2 / 41.91
GPT-5.28.6 / 29.322.5 / 36.728.6 / 56.227.5 / 52.610.0 / 16.920.6 / 40.62
Claude-Sonnet-4.65.7 / 28.322.5 / 37.025.7 / 51.530.0 / 51.915.0 / 16.620.6 / 39.33

Key Findings:

  • The best agents (Gemini-3-Flash-Preview PG=41.9%, Seed-1.8 CUA PG=39.8%) are far from novice human performance (PG=64.1%).
  • Overall Success Rates (SR) are low (12.4–21.2%), indicating agents often make partial progress but fail to complete tasks.
  • Runner games yield the highest progress for many models. Simulation tasks are most challenging.

2. Benchmark Robustness Under Repeated Evaluation

Repeated full-benchmark runs (10x) on open-source models show stable aggregate measurements, supporting GameWorld's reproducibility.

Table 7: Repeat-averaged overall SR and PG.

ModelAgent InterfaceRepeatsOverall SROverall PG
Qwen3-VL-30B-A3BComputer-Use Agent1012.7 ± 1.230.9 ± 1.1
Qwen3-VL-30B-A3BGeneralist Agent1012.5 ± 1.330.7 ± 1.1
Qwen3-VL-235B-A22BComputer-Use Agent1013.8 ± 0.730.4 ± -0.7
Qwen3-VL-235B-A22BGeneralist Agent1013.6 ± 1.430.1 ± 0.5

3. Capability-Aligned Curriculum Analysis

Games are grouped into a five-level curriculum based on dominant capability bottlenecks (see Fig. 5):

  1. Level-1 (Basic Control & Timing Grounding): e.g., breakout, stack. Tests basic action grounding.
  2. Level-2 (System-1 Reactive Control): e.g., flappy-bird, temple-run-2. Tests high-frequency reflexes.
  3. Level-3 (System-2 Spatial Navigation): e.g., mario-game, pacman. Tests deliberate pathfinding.
  4. Level-4 (Symbolic Reasoning & Strategy): e.g., 2048, tetris. Tests strategic planning.
  5. Level-5 (Open-World Coordination & Management): e.g., minecraft-clone, monkey-mart. Tests long-horizon coordination.

Finding: Both interfaces peak at Level-4 (Reasoning) and Level-2 (Reactive Control), but performance drops sharply at Level-1 (Timing Grounding) and Level-5 (Long-Horizon Coordination).

4. Challenges and Analyses

  • Real-Time Interaction (GameWorld-RT): When the environment does not pause during inference, performance remains challenging. Latency becomes part of the task. Table 8: GameWorld-RT results.

    ModelReal-Time sec/stepSRPG
    Qwen3-VL-235B-A22B (CUA)6.217.133.2
    Qwen3-VL-30B-A3B (CUA)2.415.633.0
  • Context-Memory Sensitivity: Increasing memory rounds raises latency and input tokens. Performance improves modestly for Generalist agents but declines for CUAs, suggesting low-level action traces are harder for models to interpret usefully. Table 9: Memory-round sensitivity.

    Memory RoundsModelInput Tokenssec/stepPG
    0Qwen3-VL-235B-A22B (GEN)12785.530.0
    2Qwen3-VL-235B-A22B (GEN)30528.630.6
    157Qwen3-VL-235B-A22B (CUA)562712.828.7
  • Action Validity: The Invalid Action Rate (IAR) measures instruction-following reliability.

    IAR=1rR#valid_actions(r)rR#proposed_actions(r)\text{IAR} = 1 - \frac{\sum_{r \in R} \#\mathrm{valid\_actions}(r)}{\sum_{r \in R} \#\mathrm{proposed\_actions}(r)}

    Categories are No-Tool-Call (NTC) and Out-of-Space (OOS). Most proprietary models have near-zero IAR, while some open-source models (e.g., GLM-4.6V IAR=8.3%) struggle. Table 11: Invalid Action Rate (IAR) across agents.

    ModelIAR (%)NTC (%)OOS (%)
    GLM-4.6V (GEN)8.37.60.7
    Qwen3-VL-30B-A3B (GEN)2.72.7<0.1
    Overall Mean0.80.80.0
  • Failure Modes: Four categories are identified: Perception failures (misreading visual state), Fine-grained action failures (mistimed execution), Instruction-following failures (violating controls), and Long-horizon memory failures (losing context or repeating loops).

Theoretical and Practical Implications

  • Standardization in Agent Evaluation: GameWorld demonstrates the importance and feasibility of creating standardized, verifiable, and reproducible benchmarks for interactive agents, moving beyond heuristic evaluation.
  • Interface-Aware Design: The study of two distinct agent interfaces (CUA vs. Generalist) under a shared runtime provides a framework for understanding the trade-offs between low-level control precision and high-level semantic planning, informing future agent architecture design.
  • Diagnostic Benchmarking: The introduced metrics (SR, PG), curriculum analysis, and failure mode categorization offer tools for diagnosing specific capability bottlenecks in agents (e.g., timing grounding vs. long-horizon planning), guiding targeted improvements.
  • Real-World Relevance: The challenges exposed—especially in real-time interaction and long-horizon coordination—directly mirror hurdles for deploying embodied AI in dynamic real-world environments, making game agents a relevant stepping stone.

Conclusion

GameWorld establishes a robust, standardized, and verifiable benchmark for evaluating multimodal game agents. The results across 34 games and 18 model-interface pairs reveal that while current agents can make partial progress, they remain far from reliable task completion and human-level performance. The analyses highlight distinct challenges: real-time interaction couples latency with performance, context-memory benefits are interface-dependent, and instruction-following reliability varies. GameWorld provides a reproducible foundation for advancing research on multimodal agents, with future work needed to automate task/action-space generation and improve agents' capabilities in timing, navigation, and long-horizon coordination.