Summary (Overview)

  • Framework: Introduces OpenGame, the first open-source agentic framework designed for end-to-end web game creation from natural-language specifications.
  • Core Innovation: Develops Game Skill, a reusable capability composed of a Template Skill (evolving library of project skeletons) and a Debug Skill (living protocol of verified fixes) to stabilize project architecture and systematically repair integration errors.
  • Specialized Model: Trains GameCoder-27B, a domain-specialized code LLM through a three-stage pipeline (Continual Pre-training, Supervised Fine-tuning, Reinforcement Learning) to master game engine patterns.
  • Evaluation: Introduces OpenGame-Bench, a new evaluation paradigm that dynamically assesses generated games along Build Health, Visual Usability, and Intent Alignment via headless browser execution and VLM judging.
  • Performance: Achieves state-of-the-art results across 150 diverse game prompts, outperforming strong baseline models and agentic frameworks.

Introduction and Theoretical Foundation

Game development is a complex challenge requiring the orchestration of game engines, real-time loops, and tightly coupled state across many files. While LLMs and code agents excel at isolated programming tasks, they consistently fail at end-to-end game creation, collapsing under:

  1. Logical Incoherence: Loss of global state across the game loop.
  2. Engine-Specific Knowledge Gaps: Misuse or ignorance of framework-native systems.
  3. Cross-File Inconsistencies: Broken scene wiring, mismatched asset keys, or flawed initialization order.

The paper argues that the field must move beyond generalist code agents toward specialist frameworks that understand the intrinsic structure of games. OpenGame is proposed to bridge this gap. Its theoretical foundation rests on structuring the generation process through reusable skills and a specialized model to address these systemic failures.

Methodology

The methodology consists of three pillars: base model training, autonomous agent workflow design, and agent evolution with Game Skills.

Base Model Training (GameCoder-27B)

Built on a Qwen3.5-27B backbone, trained via a three-stage pipeline:

  1. Continual Pre-Training (CPT): Adapts the model to interactive web games using a corpus from open-source Phaser/JS repositories and documentation.
  2. Supervised Fine-Tuning (SFT): Aligns the model with instruction-following using a synthetic QA dataset curated by GPT-5.1 and solutions from MiniMax-2.5.
  3. Reinforcement Learning (RL): Refines code generation with execution-based feedback at the component level, rewarding success and test pass rates.

Code Agent Design

The agent follows a structured six-phase workflow:

  1. Initialization & Classification: Uses classify-game-type tool with a Physics-First Classification rule.
  2. Scaffolding: Executes run_shell_command to copy a shared core and archetype-specific modules/docs.
  3. Design Generation: Invokes generate-gdd to produce a technical Game Design Document (GDD), then uses todo_write for granular planning.
  4. Multimodal Asset Synthesis: Reads asset_protocol.md, then uses generate-game-assets and generate-tilemap to synthesize assets, recording keys from asset-pack.json.
  5. Context-Aware Code Implementation: Merges GDD parameters into gameConfig.json. Uses a Three-Layer Reading Strategy (API summary, source file, implementation guide) and follows a Template Method Pattern, overriding designated hook methods (e.g., setupCustomCollisions).
  6. Verification & Self-Correction: Reads debug_protocol.md, runs npm run build and npm run test, and iteratively repairs failures.

Agent Evolution with Game Skills

Game Skill is defined as a reusable capability for converting a specification xx into a runnable project yy. It consists of two components:

Template Skill: Begins with a single meta template M0M_0 (a minimal game-agnostic skeleton). Maintains an evolving template library LL by abstracting stable, reusable fragments from successful tasks. LL grows into specialized families (e.g., gravity-based side view, top-down continuous motion).

Debug Skill: Maintains a living debugging protocol PP, updated from observed outcomes. Each failure adds a structured entry (error signature, root cause, verified fix). PP includes lightweight pre-execution validations for high-frequency inconsistency classes.

The overall execution is summarized in Algorithm 1:

Algorithm 1: Game Skill execution
Input: User specification x, meta template M0, template library L, debug protocol P
Output: Runnable game project y
Select a template family T ∈ L (initialized as M0 at the beginning of training);
Instantiate T to scaffold a project skeleton y;
Generate game-specific content conditioned on x within the extension points of y;
repeat // until convergence
    Run verification and execution (build, test, run) guided by P;
    if failure observed then
        Diagnose the failure using P and repair y;
        Append a verified (signature, cause, fix) entry to P if the pattern is new;
until y is buildable and runnable;
Optionally extract reusable fragments from y and merge into L;
return y

Empirical Validation / Results

Evaluation is conducted on OpenGame-Bench, a benchmark of 150 browser game tasks spanning five genres.

Experimental Setup & Metrics

  • Benchmark: 150 unique natural-language prompts, sourced from game-jam repositories and AI-assisted design briefs.
  • Evaluation Protocol: Generated project is served locally; valid run requires successful build, no fatal runtime errors, and a non-empty screenshot. Each task evaluated three times.
  • Metrics (scaled to [0, 100]):
    • Build Health (BH): Measures compilation, loading, and rendering without critical errors.
    • Visual Usability (VU): Combines pixel-level heuristic (frame entropy, motion detection) with a VLM judge score.
    • Intent Alignment (IA): Weighted pass rate from per-requirement VLM verdicts against a structured requirement specification.

Main Results

Table 1 reports mean performance across valid runs.

CategorySystem / ModelBuild HealthVisual UsabilityIntent Alignment
Direct LLMs (Open-Source)Qwen-3.5-Max51.835.538.9
MiniMax m2.539.739.331.8
GLM-4.546.545.031.2
Kimi K2.545.646.844.6
DeepSeek V3.257.038.933.5
Direct LLMs (Closed-Source)Claude Sonnet 4.658.550.850.3
GPT-5.157.452.949.4
Gemini 3.1 Pro53.660.242.1
Agentic Frameworksqwen-code (w/ Qwen-3.5-Max)57.741.340.2
qwen-code (w/ MiniMax m2.5)48.139.134.6
qwen-code (w/ Kimi K2.5)59.652.149.9
qwen-code (w/ Claude Sonnet 4.6)63.254.357.8
Cursor (w/ Kimi K2.5)57.155.254.2
Cursor (w/ Claude Sonnet 4.6)66.861.458.9
Ours (OpenGame)w/ Qwen-3.5-27B62.853.849.8
w/ GameCoder-27B63.957.054.1
w/ Claude Sonnet 4.672.467.265.1

Key Findings:

  • OpenGame with Claude Sonnet 4.6 establishes a new state-of-the-art, outperforming the strongest baseline (Cursor w/ Claude 4.6) by +5.6 BH, +5.8 VU, and +6.2 IA.
  • The largest gain is in Intent Alignment, indicating better preservation of user-specified mechanics.
  • OpenGame with GameCoder-27B outperforms all direct LLM baselines on BH and IA.

Ablation Studies

Ablation I: Base Code Model Training Pipeline (Table 2)

Model StageTraining ComponentsBuild HealthVisual UsabilityIntent Alignment
Base ModelQwen-3.5-27B (in OpenGame)62.853.849.8
Stage 1+ CPT63.254.750.6
Stage 2+ CPT + SFT63.555.752.5
Stage 3 (Full)+ CPT + SFT + RL63.957.054.1

Each training stage provides incremental gains, with SFT delivering the largest boost in Intent Alignment.

Ablation II: Agent Architecture and Reading Strategies (Table 3)

Agent ConfigurationBuild HealthVisual UsabilityIntent Alignment
OpenGame (Full Workflow)72.467.265.1
w/o Hook-Driven Implementation62.357.653.5
w/o Three-Layer Reading67.861.956.5
w/o Physics-First Classification70.264.661.6

The Template Method Pattern (Hook-Driven Implementation) is the most critical component. The Three-Layer Reading Strategy also significantly impacts Intent Alignment.

Ablation III: Agent Evolution and Game Skills (Table 4)

Template Architecture (L)Debugging Strategy (P)Build HealthVisual UsabilityIntent Alignment
Static Skeleton (M0)Static Rule Checklist60.554.851.2
Static Skeleton (M0)Full Living Protocol (P)65.459.256.3
Partial Evolved Library (2 Families)Static Rule Checklist63.157.353.8
Full Evolved Library (5 Families)Static Rule Checklist66.360.757.9
Full Evolved Library (5 Families)Post-Execution Fixes Only69.563.861.4
Full Evolved Library (5 Families)Full Living Protocol (P)72.467.265.1

Both Template Skill (evolved library) and Debug Skill (full living protocol) are essential for peak performance. Pre-execution validations in the protocol prevent catastrophic failures.

Figure 3 shows performance improves monotonically with the maximum allowed debugging iterations TT, with steep gains between T=0T=0 and T=3T=3, plateauing toward T=5T=5.

Figure 4 shows a genre breakdown of Intent Alignment scores. OpenGame performs best on physics-centric genres (Platformers: 76.8, Top-Down Shooters: 71.4) and degrades on more abstract genres (Strategy: 58.2, Puzzle/UI: 52.6).

Theoretical and Practical Implications

  • Specialization is Key: Reliable game generation requires not just stronger code models, but persistent structural priors (Template Skill) and cumulative debugging knowledge (Debug Skill).
  • Beyond Static Code Evaluation: Evaluating interactive software requires dynamic playability assessment (OpenGame-Bench) rather than static unit tests.
  • Democratizing Game Creation: The framework lowers the barrier to entry, allowing diverse users (creators, educators, content producers) to turn natural-language ideas into executable games.
  • Direction for Agentic Software Engineering: OpenGame pushes code agents beyond discrete problems toward building complex, interactive real-world applications. The evolved skill approach may be applicable to other domains requiring multi-file, stateful systems.

Conclusion

OpenGame combines a structured multi-phase workflow, Game Skill (Template and Debug), and a domain-specialized model (GameCoder-27B) to significantly improve the ability of code agents to generate fully playable games from natural language. The introduced OpenGame-Bench provides a dynamic evaluation paradigm. The results demonstrate that reliable game generation requires structural priors, reusable debugging knowledge, and execution-grounded evaluation. The framework is open-sourced to serve as a foundation for future research on agentic software engineering and AI systems for creative, interactive applications.