GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers

Summary (Overview)

Introduces GBQA, a novel benchmark containing 30 diverse games with 124 human-verified bugs across three difficulty levels (Easy, Medium, Hard) to evaluate LLMs' ability to autonomously discover software bugs in interactive environments.
Proposes a scalable multi-agent system for constructing game environments and injecting bugs, with a human-in-the-loop verification process to ensure annotation correctness.
Develops a baseline interactive QA agent equipped with a ReAct-driven exploration loop, verification-based reflection, and a hierarchical memory module for long-horizon bug discovery.
Empirical results show autonomous bug discovery is highly challenging: the best-performing model (Claude-4.6-Opus in thinking mode) identifies only 48.39% of bugs, revealing a significant gap compared to LLM performance on code generation/fixing tasks.
Provides comprehensive analysis including ablation studies on step budgets and memory components, reliability validation of annotations and evaluation, and a case study demonstrating a fully autonomous discovery-to-patch pipeline.

Introduction and Theoretical Foundation

The evolution of software development in the LLM era is moving from human-driven workflows (Fig. 1a) and human-LLM collaborative coding (Fig. 1b) towards fully autonomous systems capable of generating code, detecting bugs, and fixing them without human intervention (Fig. 1c). While significant progress has been made in code generation and fixing, the testing and bug discovery side of the development cycle remains largely unexplored.

Bug discovery poses fundamentally different and harder challenges than code generation or fixing:

Ill-defined Objective: The agent must proactively determine that "something is wrong" without being told what to look for.
Comprehensive Exploration: It demands systematic planning over large behavioral state spaces rather than targeted edits.
Reasoning about Specifications: The agent must reason about the gap between expected and actual runtime behavior, often without explicit specifications.

The paper takes game development as a representative domain because games are self-contained software systems with internal state management, user input handling, and output rendering. They require long-term dynamic interactions, making them ideal for representing real-world software engineering. Bug discovery in games corresponds directly to Quality Assurance (QA) in real applications.

Motivated by these considerations, the authors introduce the Game Benchmark for Quality Assurance (GBQA) to evaluate LLMs' ability to autonomously discover bugs in interactive environments, addressing the upstream gap in the autonomous software engineering pipeline.

Methodology

Task Definition

A game environment is defined as a tuple $E = (S, A, T, s_0)$ , where:

$S$ : state space
$A$ : action space
$T: S \times A \rightarrow S$ : state transition function
$s_0 \in S$ : initial state

Optionally, documentation context $D$ (design documents, source code) may be provided. The agent interacts over multiple turns, forming an exploration trajectory $\tau = (s_0, a_0, s_1, a_1, \ldots, s_N)$ .

Let $B = \{B_1, B_2, \ldots, B_M\}$ denote the set of ground-truth bugs. After exploring $E$ , the agent produces a set of bug reports $R = \{R_1, R_2, \ldots, R_K\}$ . The objective is to maximize coverage of $B$ by $R$ .

Two operational modes are defined:

Player Exploring Mode: $D = \emptyset$ , agent relies solely on interactive observations.
Quality Assurance Mode: $D$ is provided, agent performs specification-driven testing.

The general procedure is summarized in Algorithm 1.

Game Environment Builder

A hierarchical multi-agent collaboration system simulates a professional game studio:

Producer Agent: Decomposes high-level concepts into structured proposals.
Specialized Teams (Design, Programming, Art): Each with a Team Lead Agent that decomposes tasks and coordinates worker agents.
Shared Support Platform: Uses the Agent Skills paradigm with reusable, self-contained modules.

All games are deployed as lightweight web applications with strict frontend-backend separation. An iterative complexity scaling mechanism ensures non-trivial environments: if initial bug count is below threshold $\tau$ , additional features are automatically introduced.

Benchmark Construction

GBQA consists of 30 diverse game environments across six genres (Action, Adventure, Role-Playing, Strategy, Simulation, Puzzle) with 124 human-verified bugs.

Discovery Difficulty Taxonomy:

Easy: Surface-level perception inconsistencies identifiable from a single observation.
Medium: Violations of gameplay logic requiring reasoning over short interaction sequences.
Hard: Long-horizon consistency tracking across extended trajectories.

Ground-Truth Curation: A two-phase protocol integrates automated discovery with expert validation by three professional QA engineers, with disagreements resolved by majority voting.

Evaluation Metrics

The primary metric is Recall, defined as:

\text{Recall} = \frac{|B^+|}{|B|}

where $B^+ = \{B_j \in B | \exists R_i \in R, f(R_i, B_j) = 1\}$ and $f: R \times B \rightarrow \{0, 1\}$ is a critic agent determining semantic correspondence.

Recall is prioritized because false negatives (undetected defects) carry higher costs in practical QA workflows than false positives.

Baseline Agent Architecture

ReAct-Driven Exploration with Verification-Based Reflection

The agent follows the ReAct paradigm, augmented with a step-level reflection and verification mechanism. After each transition $(o_t, a_t, o_{t+1})$ , the agent:

Evaluates if the outcome aligns with expected behavior.
Upon detecting discrepancy, formulates a bug hypothesis (triggering action, observed vs. expected behavior, violation type).
Initiates a local verification phase to collect corroborating evidence.
Assigns a confidence score, with only high-confidence candidates promoted to final reports.

Hierarchical Memory Module

In-Session Memory: Maintains structured working memory tracking game state evolution. Uses a sliding-window strategy with $k$ recent steps preserved in detail and older steps compressed into abstraction-oriented summaries preserving causal structure.
Cross-Session Memory: Persistent store for each game that distills accumulated experience (explored regions, confirmed bugs, unresolved hypotheses) into structured summaries injected into subsequent sessions.

Empirical Validation / Results

Experimental Setup

Models: Evaluated diverse frontier LLMs including Claude-4.6-Opus, GPT-5.2, Gemini-3, DeepSeek, Llama, and Qwen series in both instruct and thinking modes.
Settings: Each model serves as backbone for the baseline agent, evaluated under both Player Exploring and Quality Assurance modes.
Budgets: Maximum interaction steps $T \in \{50, 100, 200, 500\}$ .
Metric: Recall computed via automated evaluation by critic agent.

Main Results

Table 1: GBQA Leaderboard (Recall % under two testing modes across four step budgets)

Model	Player Exploring Mode	Quality Assurance Mode	Best Performance
	50	100	200
LLMs in Instruct Mode
Claude-4.6-Opus	14.52	20.97	25.81
Claude-4.5-Sonnet	11.29	16.13	18.55
GPT-5.2	7.26	10.48	12.90
Kimi-K2.5-1T-A32B	6.45	9.68	11.29
Gemini-3-Flash	6.45	8.87	10.48
DeepSeek-V3.2	6.45	9.68	10.48
Llama-3.1 8B	2.42	3.23	4.84
Llama-3.1-70B	4.03	6.45	8.06
Qwen3-8B	4.03	5.65	6.45
Qwen3-32B	4.84	7.26	9.68
Qwen3-235B-A22B	5.65	9.68	10.48
Qwen3.5-397B-A17B	8.06	11.29	13.71
LLMs in Thinking Mode
Claude-4.6-Opus-Thinking	16.94	23.39	29.03
Claude-4.5-Sonnet-Thinking	12.10	17.74	21.77
OpenAI-o3	11.29	16.13	20.97
Kimi-K2.5-1T-A32B-Thinking	8.87	12.90	16.13
Gemini-3-Pro	10.48	15.32	19.35
DeepSeek-R1	11.29	17.74	22.58
Qwen3-8B-Thinking	7.26	10.48	12.90
Qwen3-32B-Thinking	9.68	14.52	19.35
Qwen3-235B-A22B-Thinking	10.48	16.13	20.97
Qwen3.5-397B-A17B-Thinking	13.71	19.35	25.00

Key Findings:

Challenging Benchmark: Autonomous bug discovery remains highly challenging. The best configuration (Claude-4.6-Opus-Thinking in QA Mode with 500 steps) achieves only 48.39% recall, leaving over half of bugs undetected.
Scaling Law: Reasoning capability proves more parameter-efficient than model scale alone. Qwen3-32B-Thinking (33.87%) outperforms much larger Llama-3.1-70B (14.52%) and rivals Qwen3-235B-A22B (18.55%).
Testing Mode Advantage: Quality Assurance mode consistently outperforms Player Exploring mode across all models and step budgets, but performance remains suboptimal.
Primary Bottlenecks: The gap indicates limitations in (i) susceptibility to hallucinations and logical inconsistencies during complex multi-step reasoning, and (ii) deficit in systematic testing heuristics due to scarcity of QA-specific training.

Reliability of GBQA

Table 2: Inter-Annotator Agreement analysis for human annotation

Annotation Set	Count	Krippendorff's α [95% CI]
Valid Bug	124	0.8920 [-0.0613, +0.0614]
Non-Bug	254	0.9180 [-0.0462, +0.0461]
Overall Candidates	378	0.9010 [-0.0391, +0.0389]

Table 3: Pearson correlation of critic agent with human evaluators

Model	Pearson ρ [95% CI]	p-value
Gemini-3-Pro	0.858 [-0.0548, 0.0404]	< 0.0001
Claude-4.6-Opus	0.821 [-0.0672, 0.0502]	< 0.0001
DeepSeek-R1	0.807 [-0.0717, 0.0538]	< 0.0001
GPT-5.2	0.903 [-0.0273, 0.0196]	< 0.0001

The benchmark achieves high annotation reliability (α = 0.901) and the critic agent shows high correlation with human evaluation (GPT-5.2 achieves ρ = 0.903).

Ablation Studies

Step Budget Analysis (Fig. 3): Easy bugs are largely discovered within first 300 steps, Medium bugs reach ~30% at 500 steps, while Hard bugs show strongest dependence on step budget with no clear saturation trend.

Memory Ablation (Fig. 4): The full memory module (IS+CS) consistently dominates other settings, demonstrating complementary benefits from intra-session trajectory tracking and inter-session knowledge accumulation.

Case Study: Autonomous Detection-to-Patch Pipeline

A case study on the CASTLE environment demonstrates a fully autonomous pipeline where a QA agent discovers bugs and a coding agent (Claude Code) repairs them. Over three sessions:

Session 1: QA discovers BUG-2 and BUG-3 → Coding agent repairs them.
Session 2: QA verifies fixes and discovers BUG-1 → Coding agent repairs it.
Session 3: QA verifies all fixes are correct.

Result: 100% discovery and fixing rates (3/3 bugs) on CASTLE, demonstrating feasibility of automating the defect discovery stage.

Theoretical and Practical Implications

Theoretical Implications:

Formalizes Autonomous Bug Discovery: GBQA provides a formal framework for evaluating LLMs on proactive defect detection in interactive environments.
Highlights Capability Gaps: Reveals that bug discovery is substantially harder than code generation/fixing, with different cognitive demands.
Advances Agent Benchmarking: Introduces a new evaluation dimension where the environment itself is the object of evaluation, complementing existing task-completion benchmarks.

Practical Implications:

Benchmark for QA Agent Development: Provides a standardized testbed for developing and comparing QA agents.
Scalable Construction Methodology: The multi-agent game builder enables controllable, scalable benchmark expansion.
Towards Autonomous Software Engineering: Demonstrates a step toward fully autonomous coding systems that can discover and fix bugs without human intervention.
Training Data for QA: Could generate training data for improving LLMs' systematic testing heuristics.

Conclusion

GBQA presents a scalable benchmark for evaluating autonomous bug discovery capabilities of LLMs in interactive game environments. Experimental results reveal that state-of-the-art LLMs remain substantially limited in bug discovery, particularly for long-horizon and state-dependent errors, highlighting a significant gap between current agent capabilities and real-world QA demands.

The benchmark provides standardized environments, quantitative metrics, and reliable evaluation, offering a foundation for principled design and comparison of future QA agents. This opens a new research direction at the intersection of agentic reasoning and software development.

Future Directions: Extend GBQA beyond games towards broader domains, incorporate multimodal perception and GUI interaction to better reflect real-world scenarios, and explore training methods to improve systematic testing heuristics in LLMs.