Summary of "QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation"
Summary (Overview)
- QuanBench+ is a new benchmark for evaluating Large Language Models (LLMs) on generating quantum code across three major frameworks: Qiskit, PennyLane, and Cirq. It contains 42 aligned tasks covering quantum algorithms, gate decomposition, and state preparation.
- The benchmark uses executable functional tests for evaluation, reporting Pass@1 and Pass@5 scores, and employs KL-divergence-based acceptance for tasks with probabilistic outputs.
- Key findings show significant framework-dependent performance: the best one-shot (Pass@1) scores are 59.5% (Qiskit), 54.8% (Cirq), and 42.9% (PennyLane). Qiskit is consistently the easiest, PennyLane the hardest.
- Feedback-based repair substantially improves performance, raising the best scores to 83.3% (Qiskit), 76.2% (Cirq), and 66.7% (PennyLane), but does not eliminate deeper semantic reasoning errors.
- The results indicate that while progress is evident, reliable multi-framework quantum code generation remains unsolved and performance is still heavily dependent on framework-specific knowledge rather than portable quantum reasoning.
Introduction and Theoretical Foundation
Large Language Models (LLMs) excel at classical code generation, but their ability to generate correct quantum programs is less understood. Quantum programming differs fundamentally because outputs are probabilistic measurement statistics. A qubit state is represented as:
where and denote measurement probabilities.
Existing quantum code benchmarks (e.g., Qiskit HumanEval, QHackBench, QuanBench) are mostly single-framework evaluations. This makes it difficult to disentangle whether model failures stem from weak quantum reasoning or weak framework familiarity (e.g., API misuse).
QuanBench+ addresses this by holding task intent constant while varying only the target framework (Qiskit, PennyLane, Cirq). This design allows the research to answer three key questions (RQs):
- RQ1: How accurately can modern LLMs generate correct quantum code across frameworks?
- RQ2: To what extent are performance gains driven by framework-specific boilerplate (prefill) vs. true task-level reasoning?
- RQ3: How much can an automated feedback loop improve one-shot performance?
Methodology
Benchmark Construction
- Source: Derived from the original QuanBench task set (Guo et al., 2025), adapted for cross-framework compatibility.
- Tasks: 42 tasks across three categories: Quantum Algorithms (31), Gate Decomposition (5), and State Preparation (6).
- Prompt Standardization: Prompts were modified for each framework to ensure correct library imports and a strict "code-only" output requirement. A small subset of tasks required edits for consistent grading (see Table 1).
- Canonical Solutions: A unified set of reference solutions was created for all frameworks to ensure fair comparison.
Evaluation Metrics
-
Pass@k: The probability that at least one of the top- generated solutions is correct.
where is the number of generated samples and is the number of correct samples. Reported for and .
-
KL Divergence for Probabilistic Outputs: For tasks with probabilistic outputs, correctness is determined by comparing the generated distribution to the canonical distribution .
A small smoothing constant is applied to avoid undefined values. A solution is accepted if . This threshold was calibrated from repeated executions of canonical circuits (see Appendix C).
-
Exclusion of Fidelity: The benchmark does not use process fidelity (unitary overlap) as a correctness metric.
The authors argue that fidelity can yield false negatives by penalizing functionally correct circuits that are syntactically different (e.g., due to compilation/optimization), and it may not align with task-relevant error.
Experimental Setup
- Models: A diverse set of 12 frontier and open-weight LLMs were evaluated, including GPT-5.1, Gemini-3-Pro, Claude-3.7-Sonnet, and Llama-4-Maverick (see Appendix A, Table 2).
- Execution: A controlled Python 3.10 environment with fixed library versions (Qiskit v0.46.0, Cirq v1.6.1, PennyLane v0.43.1).
- Feedback Loop (Pass@1 (FB)): Models were given up to 5 chances to repair their code after receiving feedback—either a runtime exception trace or a signal that the output was wrong.
Empirical Validation / Results
RQ1: Cross-Framework Functional Correctness
- Framework Asymmetry is Dominant: Performance is not uniform. Qiskit is consistently the easiest target, PennyLane the hardest.
- Best One-Shot Scores (Pass@1):
- Qiskit: 59.5% (Gemini-3-Pro)
- Cirq: 54.8% (Gemini-3-Pro)
- PennyLane: 42.9% (GPT-5.1)
Table 3: Exact Pass@1 and Pass@1 (FB) values for all models.
Model Qiskit Pass@1 Qiskit Pass@1 (FB) Cirq Pass@1 Cirq Pass@1 (FB) PennyLane Pass@1 PennyLane Pass@1 (FB) Gemini-3-Pro 59.5 73.8 54.8 76.2 40.5 66.7 GPT-5.1 57.1 83.3 52.4 73.8 42.9 66.7 DeepSeek-R1 50.0 71.4 45.2 61.9 33.3 52.4 MoonshotAI-Kimi-K2-Thinking 47.6 69.0 38.1 57.1 26.2 38.1 Claude-3.7-Sonnet 45.2 57.1 35.7 59.5 26.2 47.6 GPT-4.1 50.0 57.1 33.3 57.1 23.8 45.2 DeepSeek-Chat 45.2 42.9 28.6 40.5 31.0 45.2 Z-ai-GLM-4.7 42.9 69.0 38.1 61.9 23.8 64.3 Gemini-2.5-Flash 40.5 61.9 35.7 50.0 23.8 40.5 Llama-4-Maverick 38.1 54.8 28.6 42.9 19.0 38.1 MiniMax-M2.1 28.6 57.1 23.8 47.6 31.0 47.6 Qwen-2.5-7B-Instruct 16.7 19.0 4.8 7.1 11.9 19.0
RQ2: Prefill vs. No-Prefill
- Providing boilerplate code (imports, function signatures) in the prompt (prefill) mainly reduces interface friction and surface-level errors.
- Gains from prefill are largest for smaller and mid-tier models, especially in PennyLane. Stronger models benefit less, indicating that prefill does not solve the harder semantic reasoning problems.
RQ3: Feedback-Based Repair
- Feedback loops lead to substantial performance improvements across all frameworks.
- Best Scores After Repair (Pass@1 (FB)):
- Qiskit: 83.3% (GPT-5.1)
- Cirq: 76.2% (Gemini-3-Pro)
- PennyLane: 66.7% (Gemini-3-Pro & GPT-5.1)
- Error Analysis: Feedback effectively fixes runtime and interface errors (syntax, missing APIs). However, the remaining failures are dominated by deeper semantic mistakes (wrong answers, logic errors). Post-feedback, wrong answers account for 53.4% of all failures.
Theoretical and Practical Implications
- Benchmarking Practice: QuanBench+ provides a crucial tool for disentangling quantum reasoning from framework proficiency, offering a more nuanced view of LLM capabilities than single-framework benchmarks.
- LLM Development for Quantum: The results indicate that achieving reliable quantum code generation requires more than scaling model size. It necessitates:
- Stronger exposure to diverse quantum software data across frameworks.
- Better support for compositional reasoning and repair.
- Closer alignment with framework-specific APIs and execution patterns.
- The Role of Feedback: The effectiveness of automated repair suggests that integrating such loops into developer tools could significantly boost practical utility, even before models achieve perfect one-shot accuracy.
Conclusion
QuanBench+ establishes that modern LLMs show real progress in quantum code generation but are not yet reliably correct across multiple frameworks. Performance remains strongly asymmetric, favoring Qiskit, indicating a heavy reliance on framework-specific knowledge.
The central conclusion is that reliable multi-framework quantum code generation is still an unsolved problem. Future progress will depend on addressing the core challenge of portable quantum reasoning, not just memorizing API patterns. QuanBench+ provides a reproducible foundation for tracking this progress.
Related papers
- MMAE: A Massive Multitask Audio Editing Benchmark
Current audio editing systems achieve exact match rates below 5%, dropping to 0% on complex mixed-modality tasks.
- TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders
TRL-BENCH reveals tabular encoder quality is capability-specific, with hybrid pipelines outperforming any single model in compositional enrichment.
- SWE-Explore: Benchmarking How Coding Agents Explore Repositories
SWE-Explore benchmarks repository exploration and finds that even strong agents are recall-limited at line level, where missing core evidence dominates failures.