# OpenGame: Open Agentic Coding for Games

> OpenGame introduces the first open-source agentic framework for end-to-end web game creation, featuring reusable Game Skills and a specialized model that outperforms existing methods.

- **Source:** [arXiv](https://arxiv.org/abs/2604.18394)
- **Published:** 2026-04-22
- **Permalink:** https://picx.dev/p/SFZENs
- **Whiteboard:** https://picx.dev/p/SFZENs/image

## Summary

## Summary (Overview)
*   **Framework**: Introduces **OpenGame**, the first open-source agentic framework designed for **end-to-end web game creation** from natural-language specifications.
*   **Core Innovation**: Develops **Game Skill**, a reusable capability composed of a **Template Skill** (evolving library of project skeletons) and a **Debug Skill** (living protocol of verified fixes) to stabilize project architecture and systematically repair integration errors.
*   **Specialized Model**: Trains **GameCoder-27B**, a domain-specialized code LLM through a three-stage pipeline (Continual Pre-training, Supervised Fine-tuning, Reinforcement Learning) to master game engine patterns.
*   **Evaluation**: Introduces **OpenGame-Bench**, a new evaluation paradigm that dynamically assesses generated games along **Build Health**, **Visual Usability**, and **Intent Alignment** via headless browser execution and VLM judging.
*   **Performance**: Achieves state-of-the-art results across 150 diverse game prompts, outperforming strong baseline models and agentic frameworks.

## Introduction and Theoretical Foundation
Game development is a complex challenge requiring the orchestration of game engines, real-time loops, and tightly coupled state across many files. While LLMs and code agents excel at isolated programming tasks, they consistently fail at **end-to-end game creation**, collapsing under:
1.  **Logical Incoherence**: Loss of global state across the game loop.
2.  **Engine-Specific Knowledge Gaps**: Misuse or ignorance of framework-native systems.
3.  **Cross-File Inconsistencies**: Broken scene wiring, mismatched asset keys, or flawed initialization order.

The paper argues that the field must move beyond generalist code agents toward **specialist frameworks** that understand the intrinsic structure of games. **OpenGame** is proposed to bridge this gap. Its theoretical foundation rests on structuring the generation process through reusable skills and a specialized model to address these systemic failures.

## Methodology
The methodology consists of three pillars: base model training, autonomous agent workflow design, and agent evolution with Game Skills.

### Base Model Training (GameCoder-27B)
Built on a Qwen3.5-27B backbone, trained via a three-stage pipeline:
1.  **Continual Pre-Training (CPT)**: Adapts the model to interactive web games using a corpus from open-source Phaser/JS repositories and documentation.
2.  **Supervised Fine-Tuning (SFT)**: Aligns the model with instruction-following using a synthetic QA dataset curated by GPT-5.1 and solutions from MiniMax-2.5.
3.  **Reinforcement Learning (RL)**: Refines code generation with execution-based feedback at the component level, rewarding success and test pass rates.

### Code Agent Design
The agent follows a structured six-phase workflow:
1.  **Initialization & Classification**: Uses `classify-game-type` tool with a **Physics-First Classification** rule.
2.  **Scaffolding**: Executes `run_shell_command` to copy a shared core and archetype-specific modules/docs.
3.  **Design Generation**: Invokes `generate-gdd` to produce a technical Game Design Document (GDD), then uses `todo_write` for granular planning.
4.  **Multimodal Asset Synthesis**: Reads `asset_protocol.md`, then uses `generate-game-assets` and `generate-tilemap` to synthesize assets, recording keys from `asset-pack.json`.
5.  **Context-Aware Code Implementation**: Merges GDD parameters into `gameConfig.json`. Uses a **Three-Layer Reading Strategy** (API summary, source file, implementation guide) and follows a **Template Method Pattern**, overriding designated hook methods (e.g., `setupCustomCollisions`).
6.  **Verification & Self-Correction**: Reads `debug_protocol.md`, runs `npm run build` and `npm run test`, and iteratively repairs failures.

### Agent Evolution with Game Skills
**Game Skill** is defined as a reusable capability for converting a specification $x$ into a runnable project $y$. It consists of two components:

**Template Skill**: Begins with a single **meta template** $M_0$ (a minimal game-agnostic skeleton). Maintains an evolving **template library** $L$ by abstracting stable, reusable fragments from successful tasks. $L$ grows into specialized families (e.g., gravity-based side view, top-down continuous motion).

**Debug Skill**: Maintains a **living debugging protocol** $P$, updated from observed outcomes. Each failure adds a structured entry (error signature, root cause, verified fix). $P$ includes lightweight pre-execution validations for high-frequency inconsistency classes.

The overall execution is summarized in **Algorithm 1**:
```pseudo
Algorithm 1: Game Skill execution
Input: User specification x, meta template M0, template library L, debug protocol P
Output: Runnable game project y
Select a template family T ∈ L (initialized as M0 at the beginning of training);
Instantiate T to scaffold a project skeleton y;
Generate game-specific content conditioned on x within the extension points of y;
repeat // until convergence
    Run verification and execution (build, test, run) guided by P;
    if failure observed then
        Diagnose the failure using P and repair y;
        Append a verified (signature, cause, fix) entry to P if the pattern is new;
until y is buildable and runnable;
Optionally extract reusable fragments from y and merge into L;
return y
```

## Empirical Validation / Results
Evaluation is conducted on **OpenGame-Bench**, a benchmark of 150 browser game tasks spanning five genres.

### Experimental Setup & Metrics
*   **Benchmark**: 150 unique natural-language prompts, sourced from game-jam repositories and AI-assisted design briefs.
*   **Evaluation Protocol**: Generated project is served locally; valid run requires successful build, no fatal runtime errors, and a non-empty screenshot. Each task evaluated three times.
*   **Metrics** (scaled to [0, 100]):
    *   **Build Health (BH)**: Measures compilation, loading, and rendering without critical errors.
    *   **Visual Usability (VU)**: Combines pixel-level heuristic (frame entropy, motion detection) with a VLM judge score.
    *   **Intent Alignment (IA)**: Weighted pass rate from per-requirement VLM verdicts against a structured requirement specification.

### Main Results
**Table 1** reports mean performance across valid runs.

| Category | System / Model | Build Health | Visual Usability | Intent Alignment |
| :--- | :--- | :--- | :--- | :--- |
| **Direct LLMs (Open-Source)** | Qwen-3.5-Max | 51.8 | 35.5 | 38.9 |
| | MiniMax m2.5 | 39.7 | 39.3 | 31.8 |
| | GLM-4.5 | 46.5 | 45.0 | 31.2 |
| | Kimi K2.5 | 45.6 | 46.8 | 44.6 |
| | DeepSeek V3.2 | 57.0 | 38.9 | 33.5 |
| **Direct LLMs (Closed-Source)** | Claude Sonnet 4.6 | 58.5 | 50.8 | 50.3 |
| | GPT-5.1 | 57.4 | 52.9 | 49.4 |
| | Gemini 3.1 Pro | 53.6 | 60.2 | 42.1 |
| **Agentic Frameworks** | qwen-code (w/ Qwen-3.5-Max) | 57.7 | 41.3 | 40.2 |
| | qwen-code (w/ MiniMax m2.5) | 48.1 | 39.1 | 34.6 |
| | qwen-code (w/ Kimi K2.5) | 59.6 | 52.1 | 49.9 |
| | qwen-code (w/ Claude Sonnet 4.6) | 63.2 | 54.3 | 57.8 |
| | Cursor (w/ Kimi K2.5) | 57.1 | 55.2 | 54.2 |
| | Cursor (w/ Claude Sonnet 4.6) | **66.8** | **61.4** | **58.9** |
| **Ours (OpenGame)** | w/ Qwen-3.5-27B | 62.8 | 53.8 | 49.8 |
| | w/ GameCoder-27B | 63.9 | 57.0 | 54.1 |
| | w/ Claude Sonnet 4.6 | **72.4** | **67.2** | **65.1** |

**Key Findings**:
*   OpenGame with Claude Sonnet 4.6 establishes a new state-of-the-art, outperforming the strongest baseline (Cursor w/ Claude 4.6) by **+5.6 BH**, **+5.8 VU**, and **+6.2 IA**.
*   The largest gain is in **Intent Alignment**, indicating better preservation of user-specified mechanics.
*   OpenGame with GameCoder-27B outperforms all direct LLM baselines on BH and IA.

### Ablation Studies
**Ablation I: Base Code Model Training Pipeline** (Table 2)

| Model Stage | Training Components | Build Health | Visual Usability | Intent Alignment |
| :--- | :--- | :--- | :--- | :--- |
| Base Model | Qwen-3.5-27B (in OpenGame) | 62.8 | 53.8 | 49.8 |
| Stage 1 | + CPT | 63.2 | 54.7 | 50.6 |
| Stage 2 | + CPT + SFT | 63.5 | 55.7 | 52.5 |
| Stage 3 (Full) | + CPT + SFT + RL | 63.9 | 57.0 | 54.1 |

Each training stage provides incremental gains, with SFT delivering the largest boost in Intent Alignment.

**Ablation II: Agent Architecture and Reading Strategies** (Table 3)

| Agent Configuration | Build Health | Visual Usability | Intent Alignment |
| :--- | :--- | :--- | :--- |
| OpenGame (Full Workflow) | 72.4 | 67.2 | 65.1 |
| w/o Hook-Driven Implementation | 62.3 | 57.6 | 53.5 |
| w/o Three-Layer Reading | 67.8 | 61.9 | 56.5 |
| w/o Physics-First Classification | 70.2 | 64.6 | 61.6 |

The **Template Method Pattern (Hook-Driven Implementation)** is the most critical component. The **Three-Layer Reading Strategy** also significantly impacts Intent Alignment.

**Ablation III: Agent Evolution and Game Skills** (Table 4)

| Template Architecture (L) | Debugging Strategy (P) | Build Health | Visual Usability | Intent Alignment |
| :--- | :--- | :--- | :--- | :--- |
| Static Skeleton (M0) | Static Rule Checklist | 60.5 | 54.8 | 51.2 |
| Static Skeleton (M0) | Full Living Protocol (P) | 65.4 | 59.2 | 56.3 |
| Partial Evolved Library (2 Families) | Static Rule Checklist | 63.1 | 57.3 | 53.8 |
| Full Evolved Library (5 Families) | Static Rule Checklist | 66.3 | 60.7 | 57.9 |
| Full Evolved Library (5 Families) | Post-Execution Fixes Only | 69.5 | 63.8 | 61.4 |
| Full Evolved Library (5 Families) | Full Living Protocol (P) | **72.4** | **67.2** | **65.1** |

Both **Template Skill** (evolved library) and **Debug Skill** (full living protocol) are essential for peak performance. Pre-execution validations in the protocol prevent catastrophic failures.

**Figure 3** shows performance improves monotonically with the maximum allowed debugging iterations $T$, with steep gains between $T=0$ and $T=3$, plateauing toward $T=5$.

**Figure 4** shows a genre breakdown of Intent Alignment scores. OpenGame performs best on physics-centric genres (**Platformers: 76.8, Top-Down Shooters: 71.4**) and degrades on more abstract genres (**Strategy: 58.2, Puzzle/UI: 52.6**).

## Theoretical and Practical Implications
*   **Specialization is Key**: Reliable game generation requires not just stronger code models, but **persistent structural priors** (Template Skill) and **cumulative debugging knowledge** (Debug Skill).
*   **Beyond Static Code Evaluation**: Evaluating interactive software requires **dynamic playability assessment** (OpenGame-Bench) rather than static unit tests.
*   **Democratizing Game Creation**: The framework lowers the barrier to entry, allowing diverse users (creators, educators, content producers) to turn natural-language ideas into executable games.
*   **Direction for Agentic Software Engineering**: OpenGame pushes code agents beyond discrete problems toward building **complex, interactive real-world applications**. The evolved skill approach may be applicable to other domains requiring multi-file, stateful systems.

## Conclusion
OpenGame combines a structured multi-phase workflow, Game Skill (Template and Debug), and a domain-specialized model (GameCoder-27B) to significantly improve the ability of code agents to generate fully playable games from natural language. The introduced OpenGame-Bench provides a dynamic evaluation paradigm. The results demonstrate that reliable game generation requires structural priors, reusable debugging knowledge, and execution-grounded evaluation. The framework is open-sourced to serve as a foundation for future research on agentic software engineering and AI systems for creative, interactive applications.

---

_Markdown view of https://picx.dev/p/SFZENs, served by PicX — AI-generated visual whiteboard summaries of research papers._
