ACES: Leave-One-Out AUC Consistency for Code Generation

Summary (Overview)

Core Insight: Test votes should be used to rank code candidates, not merely to count them. The value of a test lies in its ability to distinguish correct from incorrect code, not necessarily in its own correctness.
Key Method: The paper proposes ACES (AUC Consistency Scoring), which breaks the circular dependency between code and test quality assessment via leave-one-out evaluation. A test's quality is measured by its agreement (LOO-AUC) with the ranking induced by all other tests, requiring only the binary pass matrix.
Theoretical Foundation: The authors prove the LOO-AUC Identity: $\mathbb{E}[\text{LOO-AUC}_j(w)] - \frac{1}{2} = c_j(w) \cdot \delta_j$ , linking a test's observable consistency to its latent discriminative power $\delta_j = \alpha_j - \beta_j$ .
Two Variants: ACES-C provides closed-form weights that approximate oracle-optimal weights in expectation under a mild average-quality assumption. ACES-O iteratively optimizes a differentiable LOO-AUC objective without this assumption.
Empirical Results: Both ACES variants achieve state-of-the-art Pass@k on multiple benchmarks (HumanEval, HumanEval+, MBPP) using only the pass matrix. ACES-O leads as a standalone method, while ACES-C excels as a plug-and-play component when combined with pre-filtering (e.g., DS³).

Introduction and Theoretical Foundation

The challenge in selecting the best LLM-generated code candidate is that both the candidates and the generated tests used to judge them are unreliable. Existing methods either treat all tests equally (majority voting) or rely on heuristics without formal guarantees, failing to address the circular dependency: we need reliable tests to judge code, and reliable code to judge tests.

The key insight is to reframe the problem as a ranking task. A test's value is its discriminative power $\delta_j$ , defined as the difference in pass rates between correct and incorrect codes: $\delta_j = \alpha_j - \beta_j$ , where $\alpha_j = P(B_{ij}=1|y_i=1)$ and $\beta_j = P(B_{ij}=1|y_i=0)$ . For ranking, only $\delta_j$ matters, not the test's absolute correctness.

The paper formalizes code ranking as a weighted voting problem over the binary pass matrix $B \in \{0,1\}^{n \times m}$ , where $B_{ij} = \mathbb{1}[\text{code } c_i \text{ passes test } t_j]$ . A weight vector $w \in \Delta_m$ induces code scores $s_i(w) = \sum_{j=1}^m w_j B_{ij}$ . The performance is evaluated via Pass@k.

Key Theoretical Results:

Pass@k Bound (Theorem 2): For weights $w$ with positive mean signal $M(w) = \sum_j w_j \delta_j$ , Pass@k is lower-bounded by: $\text{Pass@}k(w) \geq 1 - \frac{n_-}{k} \exp\left(-\frac{R(w)}{2}\right), \quad \text{where } R(w) = \frac{M(w)^2}{\sum_j w_j^2}.$ The oracle-optimal weights are $w_j^* \propto \max(0, \delta_j)$ .
LOO-AUC Identity (Theorem 3): This is the core theoretical contribution. For any weights $w$ : $\mathbb{E}[\text{LOO-AUC}_j(w)] - \frac{1}{2} = c_j(w) \cdot \delta_j, \quad \text{with } c_j(w) := \frac{\pi(1-\pi)}{p_j(1-p_j)}\left(A^{(-j)}(w) - \frac{1}{2}\right).$ Here, $\text{LOO-AUC}_j(w) = \text{AUC}(S^{(-j)}, B_{:,j})$ measures the consistency of test $t_j$ with the ranking from all other tests, $\pi$ is the fraction of correct codes, $p_j$ is test $j$ 's marginal pass rate, and $A^{(-j)}(w)$ is the true AUC of the leave-one-out ranking. This identity shows that a test's observable LOO-AUC is proportional to its latent discriminative power $\delta_j$ .

Methodology

The LOO-AUC identity enables the estimation of test quality without knowing code correctness. Two algorithms are developed based on this principle.

ACES-C: Closed-Form Weighting

Under a mild Assumption 4 (the average discriminative power $\bar{\delta} > 2\sqrt{\ln2 / m}$ ), the coefficient $c_j(w_{\text{unif}})$ is positive and approximately constant across tests. Therefore, the pass-rate-corrected LOO-AUC excess directly estimates $\delta_j$ :

q_j := \left(\text{LOO-AUC}_j(w_{\text{unif}}) - \frac{1}{2}\right) p_j(1-p_j).

Theorem 6 proves that $\mathbb{E}[q_j] \propto \delta_j$ , and its sign identifies informative ( $\delta_j > 0$ ) vs. misleading ( $\delta_j < 0$ ) tests. The ACES-C weight is:

w_j = \max\left(0, \text{LOO-AUC}_j(w_{\text{unif}}) - \frac{1}{2}\right) \cdot p_j(1-p_j).

This is a one-shot, closed-form calculation with negligible overhead.

ACES-O: Optimized Weighting

For settings where Assumption 4 may not hold, ACES-O takes an optimization approach. It maximizes an objective that aggregates each test's contribution to the overall ranking consistency:

\max_{w \in \Delta_m} J(w) = \sum_{j=1}^m w_j \left(\text{LOO-AUC}_j(w) - \frac{1}{2}\right).

Since the exact AUC is non-differentiable, a logistic surrogate is used for optimization:

\widehat{\text{LOO-AUC}}_j(w) = \frac{1}{|\mathcal{P}_j| |\mathcal{N}_j|} \sum_{i \in \mathcal{P}_j} \sum_{k \in \mathcal{N}_j} \sigma\left(\gamma \left(S_i^{(-j)}(w) - S_k^{(-j)}(w)\right)\right),

where $\sigma$ is the logistic function and $\gamma$ a sharpness parameter. Weights are parameterized via softmax and optimized via gradient ascent.

Empirical Validation / Results

Experiments are conducted on three benchmarks: HumanEval (164 problems), HumanEval+ (stricter tests), and MBPP (427 problems). Candidates and tests (~200 codes, ~500 tests per problem) are generated by GPT-3.5-Turbo.

Table 2: Pass@k (%) on HumanEval, HumanEval+, and MBPP (excerpt for execution-only methods)

Method	HumanEval Pass@1	HumanEval+ Pass@1	MBPP Pass@1
GPT-3.5-Turbo (Direct)	68.38	58.75	66.80
Majority Voting (MV)	80.49<sub>+12.1</sub>	69.51<sub>+10.8</sub>	68.62<sub>+1.8</sub>
ACES-C	82.93<sub>+14.6</sub>	71.34<sub>+12.6</sub>	71.19<sub>+4.4</sub>
ACES-O	84.15<sub>+15.8</sub>	74.39<sub>+15.6</sub>	72.37<sub>+5.6</sub>

Key Findings:

State-of-the-art Performance: Using only the binary pass matrix, ACES-O achieves the best Pass@k among all execution-only methods on all three benchmarks. It even surpasses methods like DS³ that use additional static analysis signals on HumanEval and HumanEval+.
Complementary Strengths: When combined with the static analysis method DS³, ACES-C + DS³ yields the best overall results (e.g., 85.37% Pass@1 on HumanEval), as pre-filtering makes the average-quality assumption better satisfied, a regime where ACES-C excels.
Assumption Validation: Empirical analysis (Figure 3) shows Assumption 4 is well-satisfied on most tasks. Performance gains are concentrated in the "Middle" difficulty region where informative and misleading tests coexist.
Test Quality Detection: Figure 4 shows that the sign of the ACES-C weight $(LOO-AUC_j - 0.5)p_j(1-p_j)$ correctly identifies informative ( $\delta_j > 0$ ) vs. misleading ( $\delta_j < 0$ ) tests with high accuracy (≥94.8% true positive rate across benchmarks).

Theoretical and Practical Implications

Theoretical Significance: The LOO-AUC identity provides the first provable criterion for distinguishing informative from misleading tests using only the binary pass matrix, breaking the circular dependency without external supervision.
Algorithmic Impact: ACES offers two lightweight, practical algorithms. ACES-C is a near-zero-overhead, plug-and-play improvement over majority voting. ACES-O provides robust optimization for more challenging settings.
Practical Utility: The methods require only the pass matrix, adding negligible computational cost compared to code execution. They are compatible with and can enhance existing pipelines (e.g., when combined with pre-filtering or static analysis).
Broader Applicability: The principle of evaluating assessor quality via internal consistency (LOO-AUC) extends to other domains with noisy evaluators, such as LLM-as-judge ensembles or crowdsourced annotation.

Conclusion

ACES introduces a principled, theoretically grounded framework for test-weighting in code generation. By leveraging leave-one-out AUC consistency, it successfully distinguishes informative from misleading tests based solely on the binary pass matrix. The two variants, ACES-C and ACES-O, offer complementary strengths: closed-form efficiency under common conditions and robust optimization for harder cases. Empirical results demonstrate state-of-the-art performance, advancing the frontier of execution-based code selection.

Future Directions include incorporating correlations among generated tests, extending the LOO-AUC principle to other noisy-evaluator settings (e.g., process reward models, LLM-as-judge), and exploring tighter bounds with more refined modeling of test dependencies.