Summary of "QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation"

Summary (Overview)

QuanBench+ is a new benchmark for evaluating Large Language Models (LLMs) on generating quantum code across three major frameworks: Qiskit, PennyLane, and Cirq. It contains 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation.
The benchmark uses executable functional tests for evaluation, reporting Pass@1 and Pass@5 scores, and employs KL-divergence-based acceptance for tasks with probabilistic outputs.
Key findings show significant framework-dependent performance: the best one-shot (Pass@1) scores are 59.5% (Qiskit), 54.8% (Cirq), and 42.9% (PennyLane). Qiskit is consistently the easiest, PennyLane the hardest.
Feedback-based repair substantially improves performance, raising the best scores to 83.3% (Qiskit), 76.2% (Cirq), and 66.7% (PennyLane), but does not eliminate deeper semantic reasoning errors.
The results indicate that while progress is evident, reliable multi-framework quantum code generation remains unsolved and performance is still heavily dependent on framework-specific knowledge rather than portable quantum reasoning.

Introduction and Theoretical Foundation

Large Language Models (LLMs) excel at classical code generation, but their ability to generate correct quantum programs is less understood. Quantum programming differs fundamentally because outputs are probabilistic measurement statistics. A qubit state is represented as:

| \psi \rangle = \alpha | 0 \rangle + \beta | 1 \rangle

where $| \alpha |^2$ and $| \beta |^2$ denote measurement probabilities.

Existing quantum code benchmarks (e.g., Qiskit HumanEval, QHackBench, QuanBench) are mostly single-framework evaluations. This makes it difficult to disentangle whether model failures stem from weak quantum reasoning or weak framework familiarity (e.g., API misuse).

QuanBench+ addresses this by holding task intent constant while varying only the target framework (Qiskit, PennyLane, Cirq). This design allows the research to answer three key questions (RQs):

RQ1: How accurately can modern LLMs generate correct quantum code across frameworks?
RQ2: To what extent are performance gains driven by framework-specific boilerplate (prefill) vs. true task-level reasoning?
RQ3: How much can an automated feedback loop improve one-shot performance?

Methodology

Benchmark Construction

Source: Derived from the original QuanBench task set (Guo et al., 2025), adapted for cross-framework compatibility.
Tasks: 42 tasks across three categories: Quantum Algorithms (31), Gate Decomposition (5), and State Preparation (6).
Prompt Standardization: Prompts were modified for each framework to ensure correct library imports and a strict "code-only" output requirement. A small subset of tasks required edits for consistent grading (see Table 1).
Canonical Solutions: A unified set of reference solutions was created for all frameworks to ensure fair comparison.

Evaluation Metrics

Pass@k: The probability that at least one of the top- $k$ generated solutions is correct.
$\text{Pass@}k = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}}$
where $n$ is the number of generated samples and $c$ is the number of correct samples. Reported for $k=1$ and $k=5$ .
KL Divergence for Probabilistic Outputs: For tasks with probabilistic outputs, correctness is determined by comparing the generated distribution $Q$ to the canonical distribution $P$ .
$D_{\text{KL}}(P \parallel Q) = \sum_{x} P(x) \log \frac{P(x)}{Q(x)}$
A small smoothing constant $\epsilon$ is applied to avoid undefined values. A solution is accepted if $D_{\text{KL}} < 0.05$ . This threshold was calibrated from repeated executions of canonical circuits (see Appendix C).
Exclusion of Fidelity: The benchmark does not use process fidelity (unitary overlap) as a correctness metric.
$F(U_{\text{ref}}, U_{\text{gen}}) = \frac{1}{d} \left| \text{Tr}\left( U_{\text{ref}}^{\dagger} U_{\text{gen}} \right) \right|^2, \quad d = 2^{n_q}$
The authors argue that fidelity can yield false negatives by penalizing functionally correct circuits that are syntactically different (e.g., due to compilation/optimization), and it may not align with task-relevant error.

Experimental Setup

Models: A diverse set of 12 frontier and open-weight LLMs were evaluated, including GPT-5.1, Gemini-3-Pro, Claude-3.7-Sonnet, and Llama-4-Maverick (see Appendix A, Table 2).
Execution: A controlled Python 3.10 environment with fixed library versions (Qiskit v0.46.0, Cirq v1.6.1, PennyLane v0.43.1).
Feedback Loop (Pass@1 (FB)): Models were given up to 5 chances to repair their code after receiving feedback—either a runtime exception trace or a signal that the output was wrong.

Empirical Validation / Results

RQ1: Cross-Framework Functional Correctness

Framework Asymmetry is Dominant: Performance is not uniform. Qiskit is consistently the easiest target, PennyLane the hardest.
Best One-Shot Scores (Pass@1):
- Qiskit: 59.5% (Gemini-3-Pro)
- Cirq: 54.8% (Gemini-3-Pro)
- PennyLane: 42.9% (GPT-5.1)

Table 3: Exact Pass@1 and Pass@1 (FB) values for all models.

Model Qiskit Pass@1 Qiskit Pass@1 (FB) Cirq Pass@1 Cirq Pass@1 (FB) PennyLane Pass@1 PennyLane Pass@1 (FB)
Gemini-3-Pro 59.5 73.8 54.8 76.2 40.5 66.7
GPT-5.1 57.1 83.3 52.4 73.8 42.9 66.7
DeepSeek-R1 50.0 71.4 45.2 61.9 33.3 52.4
MoonshotAI-Kimi-K2-Thinking 47.6 69.0 38.1 57.1 26.2 38.1
Claude-3.7-Sonnet 45.2 57.1 35.7 59.5 26.2 47.6
GPT-4.1 50.0 57.1 33.3 57.1 23.8 45.2
DeepSeek-Chat 45.2 42.9 28.6 40.5 31.0 45.2
Z-ai-GLM-4.7 42.9 69.0 38.1 61.9 23.8 64.3
Gemini-2.5-Flash 40.5 61.9 35.7 50.0 23.8 40.5
Llama-4-Maverick 38.1 54.8 28.6 42.9 19.0 38.1
MiniMax-M2.1 28.6 57.1 23.8 47.6 31.0 47.6
Qwen-2.5-7B-Instruct 16.7 19.0 4.8 7.1 11.9 19.0

Model	Qiskit Pass@1	Qiskit Pass@1 (FB)	Cirq Pass@1	Cirq Pass@1 (FB)	PennyLane Pass@1	PennyLane Pass@1 (FB)
Gemini-3-Pro	59.5	73.8	54.8	76.2	40.5	66.7
GPT-5.1	57.1	83.3	52.4	73.8	42.9	66.7
DeepSeek-R1	50.0	71.4	45.2	61.9	33.3	52.4
MoonshotAI-Kimi-K2-Thinking	47.6	69.0	38.1	57.1	26.2	38.1
Claude-3.7-Sonnet	45.2	57.1	35.7	59.5	26.2	47.6
GPT-4.1	50.0	57.1	33.3	57.1	23.8	45.2
DeepSeek-Chat	45.2	42.9	28.6	40.5	31.0	45.2
Z-ai-GLM-4.7	42.9	69.0	38.1	61.9	23.8	64.3
Gemini-2.5-Flash	40.5	61.9	35.7	50.0	23.8	40.5
Llama-4-Maverick	38.1	54.8	28.6	42.9	19.0	38.1
MiniMax-M2.1	28.6	57.1	23.8	47.6	31.0	47.6
Qwen-2.5-7B-Instruct	16.7	19.0	4.8	7.1	11.9	19.0

RQ2: Prefill vs. No-Prefill

Providing boilerplate code (imports, function signatures) in the prompt (prefill) mainly reduces interface friction and surface-level errors.
Gains from prefill are largest for smaller and mid-tier models, especially in PennyLane. Stronger models benefit less, indicating that prefill does not solve the harder semantic reasoning problems.

RQ3: Feedback-Based Repair

Feedback loops lead to substantial performance improvements across all frameworks.
Best Scores After Repair (Pass@1 (FB)):
- Qiskit: 83.3% (GPT-5.1)
- Cirq: 76.2% (Gemini-3-Pro)
- PennyLane: 66.7% (Gemini-3-Pro & GPT-5.1)
Error Analysis: Feedback effectively fixes runtime and interface errors (syntax, missing APIs). However, the remaining failures are dominated by deeper semantic mistakes (wrong answers, logic errors). Post-feedback, wrong answers account for 53.4% of all failures.

Theoretical and Practical Implications

Benchmarking Practice: QuanBench+ provides a crucial tool for disentangling quantum reasoning from framework proficiency, offering a more nuanced view of LLM capabilities than single-framework benchmarks.
LLM Development for Quantum: The results indicate that achieving reliable quantum code generation requires more than scaling model size. It necessitates:
1. Stronger exposure to diverse quantum software data across frameworks.
2. Better support for compositional reasoning and repair.
3. Closer alignment with framework-specific APIs and execution patterns.
The Role of Feedback: The effectiveness of automated repair suggests that integrating such loops into developer tools could significantly boost practical utility, even before models achieve perfect one-shot accuracy.

Conclusion

QuanBench+ establishes that modern LLMs show real progress in quantum code generation but are not yet reliably correct across multiple frameworks. Performance remains strongly asymmetric, favoring Qiskit, indicating a heavy reliance on framework-specific knowledge.

The central conclusion is that reliable multi-framework quantum code generation is still an unsolved problem. Future progress will depend on addressing the core challenge of portable quantum reasoning, not just memorizing API patterns. QuanBench+ provides a reproducible foundation for tracking this progress.