GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
Summary (Overview)
- Introduces GameWorld, a comprehensive benchmark with 34 diverse browser games and 170 tasks for standardized evaluation of Multimodal Large Language Model (MLLM) agents, spanning five genres (Runner, Arcade, Platformer, Puzzle, Simulation).
- Proposes a unified evaluation framework featuring a browser-based sandbox that decouples inference latency from gameplay and an outcome-based state-verifiable evaluator that uses serialized game state for deterministic, noise-free assessment.
- Studies two agent interfaces: Computer-Use Agents (CUAs) emitting low-level keyboard/mouse controls and Generalist Multimodal Agents acting via deterministic Semantic Action Parsing into a shared executable action space.
- Key Findings: The best-performing agents (e.g., Gemini-3-Flash-Preview, Seed-1.8) achieve overall progress (PG) of ~40% but low success rates (SR ~20%), remaining far from novice human performance (SR 55.3%, PG 64.1%). Agents perform relatively well on strategic reasoning and reactive control but struggle with basic timing grounding, spatial navigation, and long-horizon coordination.
- Provides extensive analyses on benchmark robustness, real-time interaction (GameWorld-RT), context-memory sensitivity, and action validity, revealing distinct challenges and trade-offs between the two agent interfaces.
Introduction and Theoretical Foundation
Developing embodied generalist agents for real-world interaction faces challenges like latency, sparse feedback, and irreversible mistakes. Video games, particularly browser games, offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematic evaluation of MLLMs as game agents is hindered by heterogeneous action interfaces and reliance on heuristic verification methods (e.g., OCR, VLM-as-judge), which introduce noise and reduce reproducibility.
Prior benchmarks (e.g., LMGame-Bench, BALROG, VideoGame-Bench) have improved scale and realism but lack standardized, verifiable evaluation. GameWorld aims to bridge this gap by providing a standardized, comprehensive, and verifiable benchmark for multimodal game agents in browser environments. The core motivation is to create a reproducible measurement platform that isolates decision quality from inference speed and provides deterministic evaluation based on game state outcomes.
Methodology
1. Benchmark Design & Components
GameWorld consists of four integrated modules (see Fig. 2):
- MLLMs as Game Agents: Implements two agent interfaces.
- Browser-based Sandbox Environment: Manages game execution, can pause during inference, and ensures an isolated observation-action loop.
- Games & Tasks Library: 34 games across 5 genres with 170 natural-language task instructions.
- Outcome-based State-Verifiable Evaluation: Uses a JavaScript bridge to access serialized
gameAPIstate for deterministic metric computation.
2. Agent Interfaces and Action Space
Two agent interfaces are defined and normalized to a unified control space of atomic human-computer interaction events (mouse_move, key_down, wait, etc.).
Table 1: Two game agent interfaces and action-space taxonomy.
| Game Agent Interfaces | Action Space Description |
|---|---|
| Computer-Use Agent (CUA) | Action Space: Computer-use function calls. Native tools of mouse and keyboard events (e.g., left_click(x,y), press_key(key)). |
| Generalist Multimodal Agent | Action Space: Game-specific function calls. Semantic functions parsed into low-level controls (e.g., move_forward(), action_jump()). |
| Unified Control Space (Atomic Events) | Mouse: mouse_move(x,y) mouse_down(button) mouse_up(button) scroll(amount) <br> Keyboard: key_down(key) key_up(key) <br> Others: wait(duration) idle() |
- Computer-Use Agents (CUAs): Emit low-level controls directly. Must output exactly one executable action per step.
- Generalist Agents: Emit high-level semantic actions. A deterministic Semantic Action Parser maps each semantic action to a fixed low-level command. Also enforces one action per step.
3. Agent Harnesses
A shared agent harness standardizes components across all models:
- Structured Prompt: Fixed template with #Game Rules, #Role and Controls, #Task Instruction, and #Output Format.
- Context Memory: Rolling memory module storing recent interaction rounds (
user_prompt → screenshot → reasoning → action). - Reasoning: Supports deliberate thinking, which can aid planning but adds latency.
- Customized Function Calling: Game actions are registered as callable tools using each model's native function-calling interface.
4. Games and Tasks
Table wr3: Game genre of the GameWorld benchmark (Summary).
| Genre (# Games) | Key Mechanics | Example Games |
|---|---|---|
| Arcade (7) | Fast-paced, reactive control, multi-entity tracking. | pacman, breakout, google-snake |
| Platformer (8) | Precise, physics-aware spatial navigation. | mario-game, captaincallisto, doodle-jump |
| Puzzle (7) | Discrete state-space, logical reasoning, long-horizon planning. | 2048, minesweeper, tetris |
| Runner (8) | High-frequency reactive control, precise timing. | temple-run-2, flappy-bird, chrome-dino |
| Simulation (4) | Open-ended, multi-objective, resource management. | minecraft-clone, monkey-mart, fireboy-and-watergirl |
Each task has a natural-language instruction, a quantitative target, and a verifiable evaluator. Evaluation uses two metrics:
- Success Rate (SR): Fraction of runs meeting the target.
- Progress (PG): Normalized measure of advancement toward the objective.
5. Outcome-Based State-Verifiable Evaluation
Unlike heuristic methods, GameWorld's evaluator reads serialized gameAPI state (e.g., score, coordinates, lives) directly via an injected JavaScript bridge. This yields deterministic, noise-free signals. At each step, the evaluator computes a task score and the run-level best progress:
where is the starting score, is the target score, and . The model's overall metrics are then averaged over all runs ():
Empirical Validation / Results
1. Main Results
Table 6: Main results on GameWorld across 34 games and 170 tasks.
| Model | Arcade SR/PG | Platformer SR/PG | Puzzle SR/PG | Runner SR/PG | Simulation SR/PG | Overall SR/PG | Rank |
|---|---|---|---|---|---|---|---|
| Human Novice Player | 45.7 / 55.5 | 60.0 / 65.6 | 51.4 / 63.1 | 60.0 / 72.0 | 60.0 / 62.0 | 55.3 / 64.1 | – |
| Human Expert Player | 65.7 / 73.9 | 85.0 / 88.0 | 68.6 / 77.1 | 82.5 / 87.8 | 85.0 / 86.0 | 77.1 / 82.6 | – |
| Computer-Use Agents | |||||||
| Seed-1.8 | 8.6 / 31.1 | 25.0 / 40.3 | 25.7 / 52.0 | 27.5 / 50.6 | 5.0 / 11.0 | 20.0 / 39.8 | 1 |
| Claude-Sonnet-4.6 | 8.6 / 27.2 | 22.5 / 36.5 | 20.0 / 43.8 | 30.0 / 55.6 | 10.0 / 16.8 | 19.4 / 38.3 | 2 |
| Gemini-2.5-Computer-Use | 5.7 / 28.0 | 20.0 / 35.8 | 11.4 / 32.2 | 30.0 / 55.4 | 10.0 / 19.3 | 16.5 / 36.1 | 3 |
| Generalist Multimodal Agents | |||||||
| Gemini-3-Flash-Preview | 5.7 / 26.3 | 25.0 / 41.2 | 25.7 / 54.8 | 32.5 / 55.4 | 10.0 / 21.1 | 21.2 / 41.9 | 1 |
| GPT-5.2 | 8.6 / 29.3 | 22.5 / 36.7 | 28.6 / 56.2 | 27.5 / 52.6 | 10.0 / 16.9 | 20.6 / 40.6 | 2 |
| Claude-Sonnet-4.6 | 5.7 / 28.3 | 22.5 / 37.0 | 25.7 / 51.5 | 30.0 / 51.9 | 15.0 / 16.6 | 20.6 / 39.3 | 3 |
Key Findings:
- The best agents (Gemini-3-Flash-Preview PG=41.9%, Seed-1.8 CUA PG=39.8%) are far from novice human performance (PG=64.1%).
- Overall Success Rates (SR) are low (12.4–21.2%), indicating agents often make partial progress but fail to complete tasks.
- Runner games yield the highest progress for many models. Simulation tasks are most challenging.
2. Benchmark Robustness Under Repeated Evaluation
Repeated full-benchmark runs (10x) on open-source models show stable aggregate measurements, supporting GameWorld's reproducibility.
Table 7: Repeat-averaged overall SR and PG.
| Model | Agent Interface | Repeats | Overall SR | Overall PG |
|---|---|---|---|---|
| Qwen3-VL-30B-A3B | Computer-Use Agent | 10 | 12.7 ± 1.2 | 30.9 ± 1.1 |
| Qwen3-VL-30B-A3B | Generalist Agent | 10 | 12.5 ± 1.3 | 30.7 ± 1.1 |
| Qwen3-VL-235B-A22B | Computer-Use Agent | 10 | 13.8 ± 0.7 | 30.4 ± -0.7 |
| Qwen3-VL-235B-A22B | Generalist Agent | 10 | 13.6 ± 1.4 | 30.1 ± 0.5 |
3. Capability-Aligned Curriculum Analysis
Games are grouped into a five-level curriculum based on dominant capability bottlenecks (see Fig. 5):
- Level-1 (Basic Control & Timing Grounding): e.g.,
breakout,stack. Tests basic action grounding. - Level-2 (System-1 Reactive Control): e.g.,
flappy-bird,temple-run-2. Tests high-frequency reflexes. - Level-3 (System-2 Spatial Navigation): e.g.,
mario-game,pacman. Tests deliberate pathfinding. - Level-4 (Symbolic Reasoning & Strategy): e.g.,
2048,tetris. Tests strategic planning. - Level-5 (Open-World Coordination & Management): e.g.,
minecraft-clone,monkey-mart. Tests long-horizon coordination.
Finding: Both interfaces peak at Level-4 (Reasoning) and Level-2 (Reactive Control), but performance drops sharply at Level-1 (Timing Grounding) and Level-5 (Long-Horizon Coordination).
4. Challenges and Analyses
-
Real-Time Interaction (GameWorld-RT): When the environment does not pause during inference, performance remains challenging. Latency becomes part of the task. Table 8: GameWorld-RT results.
Model Real-Time sec/step SR PG Qwen3-VL-235B-A22B (CUA) 6.2 17.1 33.2 Qwen3-VL-30B-A3B (CUA) 2.4 15.6 33.0 -
Context-Memory Sensitivity: Increasing memory rounds raises latency and input tokens. Performance improves modestly for Generalist agents but declines for CUAs, suggesting low-level action traces are harder for models to interpret usefully. Table 9: Memory-round sensitivity.
Memory Rounds Model Input Tokens sec/step PG 0 Qwen3-VL-235B-A22B (GEN) 1278 5.5 30.0 2 Qwen3-VL-235B-A22B (GEN) 3052 8.6 30.6 157 Qwen3-VL-235B-A22B (CUA) 5627 12.8 28.7 -
Action Validity: The Invalid Action Rate (IAR) measures instruction-following reliability.
Categories are No-Tool-Call (NTC) and Out-of-Space (OOS). Most proprietary models have near-zero IAR, while some open-source models (e.g., GLM-4.6V IAR=8.3%) struggle. Table 11: Invalid Action Rate (IAR) across agents.
Model IAR (%) NTC (%) OOS (%) GLM-4.6V (GEN) 8.3 7.6 0.7 Qwen3-VL-30B-A3B (GEN) 2.7 2.7 <0.1 Overall Mean 0.8 0.8 0.0 -
Failure Modes: Four categories are identified: Perception failures (misreading visual state), Fine-grained action failures (mistimed execution), Instruction-following failures (violating controls), and Long-horizon memory failures (losing context or repeating loops).
Theoretical and Practical Implications
- Standardization in Agent Evaluation: GameWorld demonstrates the importance and feasibility of creating standardized, verifiable, and reproducible benchmarks for interactive agents, moving beyond heuristic evaluation.
- Interface-Aware Design: The study of two distinct agent interfaces (CUA vs. Generalist) under a shared runtime provides a framework for understanding the trade-offs between low-level control precision and high-level semantic planning, informing future agent architecture design.
- Diagnostic Benchmarking: The introduced metrics (SR, PG), curriculum analysis, and failure mode categorization offer tools for diagnosing specific capability bottlenecks in agents (e.g., timing grounding vs. long-horizon planning), guiding targeted improvements.
- Real-World Relevance: The challenges exposed—especially in real-time interaction and long-horizon coordination—directly mirror hurdles for deploying embodied AI in dynamic real-world environments, making game agents a relevant stepping stone.
Conclusion
GameWorld establishes a robust, standardized, and verifiable benchmark for evaluating multimodal game agents. The results across 34 games and 18 model-interface pairs reveal that while current agents can make partial progress, they remain far from reliable task completion and human-level performance. The analyses highlight distinct challenges: real-time interaction couples latency with performance, context-memory benefits are interface-dependent, and instruction-following reliability varies. GameWorld provides a reproducible foundation for advancing research on multimodal agents, with future work needed to automate task/action-space generation and improve agents' capabilities in timing, navigation, and long-horizon coordination.