# PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

> PlanBench-XL reveals that even frontier LLMs struggle with long-horizon planning under partial tool visibility, with silent failures causing the most severe performance collapse.

- **Source:** [arXiv](https://arxiv.org/abs/2606.22388)
- **Published:** 2026-06-24
- **Permalink:** https://picx.dev/p/Kz7hcE
- **Whiteboard:** https://picx.dev/p/Kz7hcE/image

## Summary

## Summary (Overview)

- **PlanBench-XL** is a new interactive benchmark for evaluating long-horizon adaptive planning by LLM tool-use agents in large-scale, retrieval-mediated tool ecosystems.
- It comprises **327 retail tasks** over **1,665 tools**, requiring agents to iteratively retrieve usable tools, infer implicit sub-goals, and adapt to dynamic environments.
- The benchmark introduces **retrieval-time blocking** with three failure types (explicit error, implicit silent failure, semantically misleading), forcing agents to detect disrupted paths and re-plan.
- Evaluation of 10 leading LLMs shows that even frontier models struggle: **Gemini-3.1-Pro achieves 77.06% accuracy** in the default (no-block) setting, but **GPT-5.4 collapses from 51.90% to 11.36%** under severe blocking.
- Key failure modes include **trajectory drift** (agents obtain partial evidence then diverge), **silent failures** (hardest to detect), and **inability to re-plan through longer recovery paths**.

---

## Introduction and Theoretical Foundation

### Background and Motivation
- Real-world LLM agents operate in large-scale tool ecosystems (enterprise MCP servers, software APIs, web platforms) where context-length limits force **retrieval-mediated tool access** – only a relevant subset of tools is visible at each step.
- Long-horizon tasks require agents to explore intermediate sub-goals, discover tools incrementally, and adapt plans as new information emerges.
- Existing benchmarks assume fixed visible toolsets, explicit sub-goals, clean tool descriptions, or one-shot retrieval – they **fail to capture the uncertainty** introduced by partial tool visibility and unreliable tool retrieval.

### Central Research Question
> *"Can LLM agents solve long-horizon tasks in large tool ecosystems by iteratively exploring partial tool retrieval results and adapting when plausible tool-use paths fail?"*

### Key Requirements for a New Benchmark
1. **Partial tool visibility**: agents access only retrieved subsets of a large tool space and must iteratively discover tools for intermediate information.
2. **Unreliable/noisy tools**: retrieved tools may be missing, failing, or misleading, requiring runtime adaptation.

### Comparison with Prior Work (Table 1)
| Benchmark | Tool-Use | Tool Retrieval | Implicit Sub-goals | Bi-directional Exploration | Unreliable Tools | Long-Horizon | Scalable Generation |
|-----------|----------|----------------|--------------------|----------------------------|------------------|--------------|---------------------|
| ToolBench (Qin et al., 2023a) | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✓ |
| ... | ... | ... | ... | ... | ... | ... | ... |
| **PlanBench-XL (Ours)** | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |

PlanBench-XL is the **only benchmark** that fully addresses all seven traits.

---

## Methodology

### Environment Setup

**Tool Library Construction**
- Define a set of typed retail datatypes $D$ (e.g., `person_name`, `refund_status`). Initial set proposed by generation LLM $M_{gen}$, then filtered by another LLM $M_{fil}$ to remove vague/redundant/unrealistic types.
- Construct candidate tools by considering all pairs of input/output datatype sets $(D_{in}, D_{out})$ where $|D_{in}| = m$, $|D_{out}| = n$. $M_{gen}$ proposes tool functionality; final library $T$ obtained after filtering.
- Augment with **noisy tools**: semantically similar but explicitly disclose unavailability/unreliability in descriptions.

**Backend Database**
- For each retail case, instantiate values for the full datatype set, producing a complete structured record. Tool execution matches input arguments against the record and returns output values.
- Ensures answers cannot be inferred from common sense – tool use is necessary.

**Query Generation (Three-Step Pipeline)**
1. Specify task internally as $r = (D_0, Y)$ where $D_0$ is initial input datatype set, $Y$ is target datatype set.
2. Compute ground-truth tool-call sequences $\Pi(r)$ via state-graph reachability; discard unsolvable tasks.
3. Instantiate with concrete values, verbalize as natural-language query $q$, derive ground-truth answer $o^*$ following one valid sequence $\pi \in \Pi(r)$.

**Environment Parameters (Table 2)**
| Parameter | Value |
|-----------|-------|
| Number of datatypes | 56 |
| Number of queries | 327 |
| Number of tools | 1,665 |
| Shortest path length ($L^*$) | 5–9 |
| Maximum turns ($T_{max}$) | 100 |
| Per-retrieval return cap ($\Lambda^{cap}_{ret}$) | 30 |
| Global seed ($\sigma_0$) | 42 |

### Agent–Environment Interaction
- At each step, agent outputs one of three actions: `retrieve`, `tool-call`, or `answer`. Environment responds accordingly.
- **Agent state** $s_t = (q, U_t, D_t)$ where $q$ is query, $U_t$ is discovered callable tools, $D_t \subseteq D$ is obtained datatypes.
- **Tool Retriever** supports three modes aligned with **bi-directional anticipation**:
  - *Input-conditioned retrieval* (Forward Anticipation): "What can be reached from current evidence?"
  - *Output-conditioned retrieval* (Backward Anticipation): "What tools lead to desired outcome?"
  - *Input-output-conditioned retrieval* (Bridging): specify both available info and desired result.
- **Retrieval-Time Blocking** – an optional module that replaces path-critical tools with alternatives:
  - **Explicit Failure Blocks**: return error messages (e.g., `error: endpoint unavailable`).
  - **Implicit Failure Blocks**: return unhelpful responses that silently violate documented behavior (e.g., wrong fixed value like `refund_status = tuna`).
  - **Semantically Misleading Blocks**: return tools with related-but-different functionality (e.g., `get_order_status` instead of `get_refund_status`).
  - Each blocked instance preserves at least one feasible valid path.

### Metrics
1. **Accuracy (%)**: proportion of queries with correct final answer.
2. **Executed Ground-Truth Datatype Precision (EGT Prec. %)**: fraction of unique datatypes produced by executed tool calls that belong to ground-truth datatype set.
3. **Average Turns (Avg. Turns)**: average interaction turns per query.
4. **Mean Explored Datatypes (Mean EDT)**: average number of new datatypes uncovered beyond initial input datatypes.
5. **Search-to-Call Ratio (S/C Ratio)**: tool-retrieval / tool-call turn ratio.
6. **Invalid Tool Call Rate (ITCR %)**: fraction of structurally/procedurally invalid calls.
7. **Untrusted Input Rejection Rate (UIRR %)**: fraction of tool calls rejected because argument value comes from a noisy tool response.

---

## Empirical Validation / Results

### Main Results (Default Setting, Table 3)

| Model | Accuracy (%) ↑ | EGT Prec. (%) ↑ | Avg. Turns | Mean EDT | S/C Ratio | ITCR (%) ↓ | UIRR (%) ↓ |
|-------|----------------|----------------|------------|----------|-----------|------------|------------|
| Qwen3-8B | 0.00 | 35.31 | 25.65 | 7.64 | 0.20 | 6.11 | 0.10 |
| Qwen3-14B | 0.92 | 47.77 | 35.74 | 12.01 | 0.09 | 3.94 | 0.93 |
| Qwen3-32B | 2.75 | 62.36 | 12.03 | 18.54 | 1.59 | 10.05 | 7.43 |
| Llama-3.1-8B-Instruct | 0.00 | 41.33 | 21.62 | 9.89 | 1.49 | 18.03 | 5.25 |
| Llama-3.3-70B-Instruct | 18.96 | 59.67 | 19.13 | 19.20 | 2.22 | 21.47 | 2.13 |
| DeepSeek-V4-Flash | 63.08 | 65.57 | 31.41 | 25.34 | 2.80 | 8.27 | 3.29 |
| **Gemini-3.1-Pro** | **77.06** | **91.47** | 19.55 | **27.41** | 1.59 | **0.68** | 0.30 |
| Gemini-3.5-Flash | 52.19 | 85.29 | 57.87 | 25.16 | 10.44 | 2.94 | **0.00** |
| GPT-5.4-Mini | 3.07 | 71.25 | 10.81 | 9.22 | 1.97 | 51.71 | 4.42 |
| GPT-5.4 | 51.90 | 72.92 | 22.92 | 20.65 | 2.70 | 6.28 | 1.91 |

**Key Findings (Takeaways 1–4):**
- **Takeaway 1**: Massive-tool planning remains highly challenging – only **Gemini-3.1-Pro** exceeds 60% accuracy; small models (8B) achieve 0%.
- **Takeaway 2**: Broad exploration (Mean EDT) strongly correlates with accuracy (Pearson $r = 0.902$), but frequent retrieval alone does not guarantee success. **Bi-directional** exploration matters: output-conditioned retrieval frequency correlates with accuracy ($r = 0.800$).
- **Takeaway 3**: Precise exploitation (EGT Precision) is also critical ($r = 0.781$). Top models combine broad exploration with high execution relevance.
- **Takeaway 4**: Basic tool reliability (low ITCR, low UIRR) is necessary – invalid calls reduce accuracy ($r = -0.443$).

### Blocking Analysis (Takeaway 5, Figures 2–3)

**Viable-path reduction sharply weakens performance**: As block ratio increases (from 0 to 0.8, then to 1 path), accuracy drops across all models. GPT-5.4 falls from 51.90% to ~30% (1 path) and to **11.36%** (longest path preserved).

**Silent tool failures are most harmful**: Implicit failures yield the lowest accuracy among single-type perturbations. UIRR is highest under implicit failures (11.99%) vs. explicit (9.67%) and misleading (9.89%). Silent failures inject invalid values that propagate into later tool calls.

**Agents struggle with longer recovery paths**: When only the longest valid path remains, accuracy drops sharply – GPT-5.4 falls to slightly above 10%, vs. ~30% under standard blocking.

### Inference-Time Augmentation (Takeaway 6, Figure 4)
- Enforced exploration (adding continuation prompts after incorrect termination) yields **only limited gains** (<5 percentage points). Block-setting performance remains far below no-block accuracy, indicating deeper limitations in adaptive re-planning.

### Path Length Effects (Takeaway 7, Figure 5)
- Accuracy decreases as shortest valid solution length $L^*$ increases, both in default and block settings. Longer minimal horizons amplify difficulty.

### Error Analysis (Takeaways 8–11)
- **Most failures occur after partial progress**: 72.4% of GPT-5.4 failures are "Irrecoverable Drift" – the agent makes progress, then a non-progress call, and never recovers.
- **Drift is a tool-selection failure, not retrieval failure**: In 78.0% of default cases, a valid progress tool had already been retrieved before the non-progress call.
- **Recency bias**: Models over-select recently retrieved tools; older valid tools are often ignored even when they reappear.
- **Semantically misleading blocks are rarely invoked** (≤3%), but **implicit failure blocks are frequently invoked and cause value contamination** (42.2% Value Reused across models).
- **Model-specific termination policies**:
  - GPT-5.4: **Surrenders** (77.3% of failures) despite solvability.
  - DeepSeek-V4-Flash, Llama-3.3-70B-Instruct: **Commit wrong tool values** (58.8% and 81.7% respectively).
  - Gemini-3.5-Flash: **Search exhaustion** (90.8% of failures) – keeps searching without progress.

---

## Theoretical and Practical Implications

### Theoretical Significance
- Confirms that **adaptive planning under partial observability** is a distinct challenge from tool selection or task decomposition alone. The concept of **bi-directional anticipation** (forward from known evidence, backward from desired outcome) is formalized and empirically shown to be necessary.
- Highlights a **hierarchical failure model**: (1) solution-path execution weakness → (2) amplified by corrupted tools → (3) manifested through model-specific termination biases.

### Practical Implications
- LLM agents deployed in real-world tool ecosystems (e.g., enterprise MCP servers, web APIs) will face exactly the conditions in PlanBench-XL: partial visibility, unreliable retrieval, and silent tool failures.
- **Silent failures are especially dangerous** – agents need mechanisms to detect and recover from superficially plausible but wrong tool outputs.
- **Test-time compute scaling is insufficient** – deeper architectural changes (e.g., dedicated planning modules, uncertainty-aware re-selection) are needed.
- The benchmark can serve as an **RL playground** for training agents that actively explore, detect unreliability, and re-plan.

---

## Conclusion

PlanBench-XL introduces a scalable, interactive benchmark for evaluating long-horizon adaptive planning in massive, retrieval-mediated tool ecosystems. Through 327 retail tasks over 1,665 tools with bi-directional exploration and retrieval-time blocking, the benchmark reveals that current LLM agents remain brittle:

- Even frontier models struggle with reliable tool discovery and exploitation.
- Silent failures disrupt planning more than explicit errors or misleading tools.
- Longer recovery paths amplify failure rates, and extra interaction brings only marginal gains.
- The dominant failure mode is trajectory drift: agents make partial progress, then diverge and rarely recover.

**Future directions** (Appendix F.3) include extending to multiple domains, integrating more realistic dynamic failures, and using PlanBench-XL as a training environment for robust agentic planning.

---

_Markdown view of https://picx.dev/p/Kz7hcE, served by PicX — AI-generated visual whiteboard summaries of research papers._
