Visual Summary | A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

Summary (Overview)

Proposes TASTE (Task Synthesis from Tool Sequence Evolution): an automatic, three-stage method for generating challenging agent benchmarks with broad tool-use coverage, reversing the traditional scenario-first approach by starting from tool sequences.
Introduces three desiderata for agent benchmarks: validity, difficulty, and coverage, operationalizing coverage through gold tool sequences (ordered tool names in the target trajectory).
Constructs τ_c-Bench: a challenging extension of the three domains of τ₂-Bench (Airline, Retail, Telecom), doubling unique tool combinations and increasing weighted edit distance by up to 124%.
Evaluates 11 agent/user LLM pairs: models that nearly saturate τ₂-Bench suffer severe drops (e.g., Gemini-3-Flash falls from 0.82–0.94 on τ₂-Bench to 0.28–0.61 on τ_c-Bench), indicating saturation rather than robust ability.
Demonstrates that automatic generation can produce valid, difficult, and diverse tasks at scale (total cost ~$1,245), enabling continuous evaluation as agents improve.

Introduction and Theoretical Foundation

Tool-using agents interact with environments through multi-step tool invocations, making evaluation depend on cumulative world-state changes. The standard paradigm uses final-state evaluation: a task is solved if the terminal world state matches a predefined target state induced by a gold tool-call sequence. Manual authoring of such tasks is costly, error-prone, and increasingly saturated as agent capabilities advance.

The paper identifies three desiderata for agent benchmarks:

Validity: tasks must be automatically verifiable (final-state check) and correct (gold state reachable from specification).
Coverage: tasks should span structurally diverse tool-use patterns, operationalized via gold tool sequences (ordered list of tool names, ignoring arguments). Coverage is measured by sequence-level diversity (e.g., weighted edit distance, type-token ratio of tool n-grams).
Difficulty: tasks must differentiate capability levels, arising from structural difficulty (sequence length, write vs. read operations) and interaction difficulty (instruction ambiguity, user behavior, distractor records).

The key insight: the prevailing approach writes scenarios first and derives tool sequences, capturing only arbitrary combinations. TASTE reverses this by sampling diverse tool sequences and synthesizing full tasks around them.

Methodology

TASTE operates in three stages, each targeting the desiderata:

Stage 1: Tool Sequence Sampling (Validity + Coverage)

An Adaptive Contrastive n-gram Model is trained over the space of tool sequences. Let $\text{ctx}_i = (t_{i-n+1}, \dots, t_{i-1})$ be the context. Two count tables $C^+$ (for plausibly valid sequences) and $C^-$ (for implausible sequences) are maintained. The conditional sampling probability is defined via a contrastive ratio:

S^\pm(t_i | \text{ctx}_i) = \frac{C^\pm(\text{ctx}_i, t_i) + \lambda_0}{\sum_{t' \in \mathcal{T}} C^\pm(\text{ctx}_i, t') + |\mathcal{T}| \lambda_0}

P(t_i | \text{ctx}_i) \propto \left( \frac{S^+(t_i | \text{ctx}_i)}{S^-(t_i | \text{ctx}_i)^{\lambda_{\text{neg}}}} \right)^{1/T(k)}

where $\lambda_0>0$ is Dirichlet smoothing, $\lambda_{\text{neg}} \geq 0$ controls negative evidence influence, and $T(k)$ is an exponentially decaying temperature. The model is trained via iterative sample-and-validate loops with an LLM plausibility judge. Negative evidence is crucial (e.g., penalizing n-grams that modify a reservation after cancelling it). Adaptive training improves validity from 6.7% (uniform) to 86.7%.

Stage 2: Clustering and Selection (Coverage)

From a large pool ( $N=2000$ ), $K$ representative sequences are selected via K-medoids clustering using a weighted Levenshtein distance that assigns costs based on tool semantic similarity:

Substituting functionally similar tools (e.g., search_direct_flight ↔ search_onestop_flight): cost 0.33.
Substituting same-type tools (both read or both write): cost 0.66.
Cross-type substitutions, insertions, deletions: cost 1. This yields coherent clusters, and medoids capture structurally distinct patterns. Invalid medoids are replaced by the next-closest member; if a cluster entirely invalid, it is removed and K-medoids rerun with fixed valid medoids.

Stage 3: Task Generation and Evolution (Validity + Difficulty)

For each medoid sequence:

Base task generation: An LLM creates a coherent scenario (user instruction $u$ , database state $s_0$ ) that motivates the tool sequence, inventing concrete entities and producing a verbose, unambiguous instruction.
Validity checks: Rule-based checks (structural, schema conformance) and a hint-assisted verifier agent that receives a shuffled, partially masked gold sequence; success verifies solvability. Precision is near 1.0; recall ~0.8.
Task evolution: Transforms base tasks into harder variants by:
- Strategy analysis: Identifying adversarial opportunities (e.g., policy-forbidden demands, user misinformation).
- Environment perturbation: Adding decoy database records (e.g., a fully booked flight on the desired route).
- Scenario rewriting: Making user instructions ambiguous, less cooperative. Evolved tasks are re-validated; if they fail, simpler variants are attempted before falling back to the base task. Evolution lowers success rates by 36–55% (Gemini-3 evolved) or 16–37% (GPT-5.2 evolved).

Empirical Validation / Results

Main Results (Table 1): Performance of agents on τ₂-Bench Verified (τ_BV) vs. τ_c-Bench across three domains and two user simulators.

Agent	Airline (pass^1) τ_BV → Ours (∆%)	Airline (pass^3) τ_BV → Ours (∆%)	Retail (pass^1) τ_BV → Ours (∆%)	Telecom (pass^1) τ_BV → Ours (∆%)
User: Gemini-3-flash
Gemini-3-flash	0.72 → 0.34 (-52.8%)	0.56 → 0.22 (-60.7%)	0.88 → 0.44 (-50.0%)	0.90 → 0.55 (-38.9%)
Gemini-2.5-flash	0.58 → 0.21 (-63.8%)	0.40 → 0.10 (-75.0%)	0.79 → 0.36 (-54.4%)	0.35 → 0.27 (-22.9%)
GPT-5.2	0.57 → 0.49 (-14.0%)	0.34 → 0.26 (-23.5%)	0.92 → 0.59 (-35.9%)	0.55 → 0.44 (-20.0%)
Qwen-32B	0.50 → 0.13 (-74.0%)	0.26 → 0.06 (-76.9%)	0.48 → 0.27 (-43.8%)	0.40 → 0.38 (-5.0%)
deepseek-3.1	0.49 → 0.41 (-16.3%)	0.30 → 0.16 (-46.7%)	0.47 → 0.47 (0.0%)	0.53 → 0.48 (-9.4%)
claude-sonnet-4.6	0.72 → 0.64 (-11.1%)	–	0.67 → 0.54 (-19.4%)	0.81 → 0.72 (-11.1%)
User: GPT-5.2
Gemini-3-flash	0.82 → 0.56 (-31.7%)	0.68 → 0.28 (-58.8%)	0.87 → 0.55 (-36.8%)	0.94 → 0.61 (-35.1%)
Gemini-2.5-flash	0.66 → 0.36 (-45.5%)	0.40 → 0.08 (-80.0%)	0.60 → 0.50 (-16.7%)	0.46 → 0.33 (-28.3%)
GPT-5.2	0.58 → 0.34 (-41.4%)	0.36 → 0.20 (-44.4%)	0.69 → 0.61 (-11.6%)	0.62 → 0.47 (-24.2%)
Qwen-32B	0.36 → 0.10 (-72.2%)	0.22 → 0.06 (-72.7%)	0.56 → 0.39 (-30.4%)	0.28 → 0.31 (+10.7%)
deepseek-3.1	0.48 → 0.36 (-25.0%)	0.24 → 0.18 (-25.0%)	0.67 → 0.54 (-19.4%)	0.80 → 0.61 (-23.8%)

Table 1: Performance drops on τ_c-Bench. Relative changes (∆%) are negative except Qwen-32B on Telecom (GPT-5.2 user) which slightly improves.

Coverage Gains (Figure 2): τ_c-Bench shows substantial increases in tool-sequence diversity:

Weighted Edit Distance (WED) on gold sequences: +124% (Airline), +45% (Retail), +27% (Telecom).
Type-Token Ratio (TTR) averaged over n=2..6 on gold sequences: +111% (Airline), +67% (Retail), +87% (Telecom).
Tool-frequency entropy increases by 35% (e.g., Airline entropy from 0.68 to 0.92).
Similar gains on sequences extracted from successful agent simulations.

Ablation Results (Figure 3 left):

Full adaptive contrastive model achieves 86.7% validity rate vs. 6.7% for uniform sampling.
Removing adaptive training (using only seed n-grams) yields only 16.7%.
Removing contrastive negative evidence reduces validity by 10–20 points (depending on iteration).
The temperature-annealed contrastive ratio provides an additional ~10 points.

Verifier Agent Reliability: Precision 1.0 (Airline) and 0.97 (Retail); recall 0.75 and 0.83. Manual inspection of all zero-success tasks (1 Airline, 7 Retail, 7 Telecom) confirmed all were valid; failures were due to agent mistakes.

Theoretical and Practical Implications

Benchmark Saturation: The large performance gaps (up to 80% relative drop) for top models suggest that high scores on existing benchmarks often reflect saturation rather than robust task-solving ability. τ_c-Bench provides a more demanding and diverse evaluation.
Coverage as a Design Principle: Operationalizing coverage via tool sequences and using sequence-level diversity metrics (WED, TTR, entropy) provides a systematic way to measure and improve benchmark quality, moving beyond domain or tool counts.
Automatic Generation Scalability: TASTE’s total cost (~$1,245) is modest compared to manual curation, and its three-stage design ensures validity (86.7% validity rate, near-perfect verifier precision) while targeting difficulty and coverage explicitly.
Practical Use: τ_c-Bench can serve as a drop-in replacement or augmentation for τ₂-Bench, enabling continuous evaluation. The method is domain-agnostic and applicable to single-turn and non-conversational settings.

Conclusion

TASTE automatically generates agent benchmarks that are more challenging and have broader tool-use coverage than existing manually constructed benchmarks. The key technical contributions are: (1) an Adaptive Contrastive n-gram model for sampling valid tool sequences, (2) weighted edit-distance clustering to select representative sequences, and (3) a task evolution process that increases interaction difficulty while preserving validity. τ_c-Bench, built from TASTE, exposes capability gaps that τ₂-Bench misses, with performance drops of up to 80% and coverage metrics improving by over 100%.

Future directions: apply TASTE to generate training data, extend coverage definitions to additional axes (e.g., tool argument diversity, temporal dependencies), and incorporate more sophisticated difficulty evolution patterns. The method’s reliance on tool sequences and environment specifications makes it adaptable to non-conversational and single-turn settings.