Summary (Overview)
- Defines Agentic Abstention as the sequential decision problem of when an LLM agent should stop acting rather than continue interacting with an environment, distinguishing it from single-turn LLM abstention.
- Constructs a large benchmark of over 28,000 instructions across web shopping (WebShop), terminal-based tasks (Terminal-Bench 2.0), and interactive QA (AbstentionBench), covering both request-based and environment-based abstention scenarios.
- Evaluates 13 LLM-as-agent systems and 2 agent scaffolds, revealing that abstention is a major challenge: most models achieve <50% average abstention recall and <40% average timely recall, with agents often abstaining too late or not at all.
- Proposes CONVOLVE, a context engineering method that distills full interaction trajectories into a reusable playbook of stopping rules; on WebShop, it raises Llama-3.3-70B’s timely abstention recall from 26.7% to 57.4% and overall recall from 83.2% to 100.0% without updating model parameters.
Introduction and Theoretical Foundation
Large language model (LLM) agents are designed to act over multiple turns, using search, browsing, and terminal tools to complete user goals. However, not every goal is well-specified or achievable in the available environment. In such cases, a reliable agent should recognize that further interaction is unlikely to help and abstain from additional tool calls.
Agentic Abstention is defined as the ability of an agent to recognize when a task is infeasible and to abstain rather than answering directly (and incorrectly) or taking unnecessary actions. Unlike standard LLM abstention—typically evaluated as a single-turn answer-or-abstain decision—agentic abstention is a sequential decision problem: at each turn, the agent can answer, abstain, or gather more information, and the need to abstain may only become clear after interacting with the environment.
This setting is formalized as a partially observable Markov decision process (POMDP):
where:
- denotes the latent task state (including properties not directly observable, such as whether the task is resolvable).
- The action space is , where ANSWER denotes a terminal task-completion action, ABSTAIN a terminal decision not to proceed, and ACT a non-terminal external action (e.g., search, click).
- is the observation space; at turn , the agent receives observation consisting of the current context (instruction, history, environment feedback).
- is the environment transition function.
- is the observation function.
The agent makes decisions based on interaction history , where is the initial user instruction, and selects actions according to a history-dependent policy . Each instance is annotated with a binary label , where denotes an abstain-warranted instance (cannot be reliably solved under the given interaction setting).
Methodology
Datasets. Three scenarios are constructed:
-
Web-based decision-making (WebShop): 1,000 instances with a balanced 1:1 split of solvable and unsolvable tasks. Unsolvable instances include Request-based Abstention (249 tasks: Subjective Preference, Underspecified Intent, False Premise or Contradiction) where the instruction itself makes the task infeasible, and Environment-based Abstention (251 Missing Target tasks) where the environment is modified to remove target items, making infeasibility apparent only after interaction.
-
Terminal-based task execution (Terminal-Bench 2.0): 277 instances (89 solvable, 188 unsolvable). Unsolvable instances include Request-based Abstention (87 False Premise or Contradiction, 80 Underspecified Intent) and Environment-based Abstention (21 Missing Prerequisite tasks where files, dependencies, or permissions are removed).
-
Interactive QA (AbstentionBench): 27,073 samples from 16 datasets, including answerable and should-abstain questions across five categories (Answer Unknown, False Premise, Subjective, Underspecified Context, Underspecified Intent). Each instance becomes a sequential decision problem with up to 10 search calls over a Wikipedia dump.
Models and Scaffolds. The study evaluates 13 LLM-as-agent systems:
- Proprietary: GPT-5.4-mini (OpenAI), Grok 4.1 Fast (xAI).
- Open-weight: Llama 3.3 8B/70B Instruct, GPT-OSS 120B, MiniMax M2.5, Qwen 3 8B/14B/32B/235B-A22B (Instruct and Thinking), Gemma 4 31B it, GLM 5.1.
Two agent scaffolds are compared: Terminus 2 (simple scaffold, neutral testbed) and Codex CLI. Different scenarios use different model-scaffold combinations due to cost and practicality.
Evaluation Metrics.
- AbsRec@K (Abstention Recall at step K): Proportion of abstain-warranted instances where the agent abstains within steps after abstention becomes warranted. Let be the earliest step abstention is warranted. Timely Recall is AbsRec@ (e.g., for request-based, for environment-based). Overall Recall is AbsRec@10 (max turns).
- SPL (Success weighted by normalized inverse path length): Measures whether the agent abstains successfully while penalizing delayed decisions.
- Over-abstention rate: Proportion of solvable instances where the agent incorrectly abstains.
CONVOLVE (Context Evolution). A context engineering method that learns abstention from multi-step trajectories without updating model parameters. For each training episode, the agent produces a trajectory . A reflection agent analyzes the trajectory to produce episode-level feedback , and a curator converts this into concise playbook updates. The evolving context is appended to the agent's system prompt. CONVOLVE is instantiated on WebShop using 20 training examples from the abstention subset, with playbook budget of 80,000 tokens and a held-out evaluation set of 101 examples.
Empirical Validation / Results
Abstention is an open challenge. Figure 3 shows that across all three scenarios, AbsRec@1 is very low (most models in 0.0–0.3 range), meaning agents rarely abstain at the earliest possible moment. Only as the interaction budget increases does recall improve, but even at 10 turns, many models remain below 0.5.
- Web scenario: Llama-3.3-70B performs best (≈0.84 AbsRec@10), while 6 of 8 models achieve <0.5 AbsRec@10.
- Terminal scenario: Codex CLI (0.38 AbsRec@10) significantly outperforms Terminus 2 (0.18), showing scaffold dependency.
- QA scenario: Qwen3-235B achieves best results (0.59 AbsRec@1, 0.71 AbsRec@10), but many models show little gain from search.
Abstention difficulty varies by category (Figure 4):
- Web: Missing Target (environment-based) is hardest; False Premise is easiest.
- Terminal: Underspecified Intent is especially difficult for both scaffolds.
- QA: False Premise and Underspecified Intent are hardest; Answer Unknown and Subjective are easier.
Factors affecting abstention:
- Reasoning (Figure 5): More reasoning improves timely recall but lowers overall recall. For web (Qwen-3-235B-thinking vs. instruct), timely recall rises from 7.2 to 12.6, but overall recall drops from 62.2 to 45.2. A similar tradeoff appears in the terminal scenario.
- Over-abstention on solvable tasks increases with longer interaction (up to 34% for Qwen3-235B-Instruct by turn 10 on WebShop). Medium/high reasoning reduces this effect in terminal tasks (0–2%).
- Model scaling (Figure 7): Larger models improve overall recall but not timely recall. Timely recall remains nearly flat across Qwen-3 8B–235B.
CONVOLVE results (Table 1):
| Method | AbsRec@1 | AbsRec@10 | SPL |
|---|---|---|---|
| Base Model | |||
| Llama-3.3-8B | 6.9 | 92.1 | 39.4 |
| Llama-3.3-70B | 26.7 | 83.2 | 55.3 |
| Base Model + ICL | |||
| Llama-3.3-8B | 11.2 (+4.3) | 81.1 (-10.0) | 39.1 (-0.3) |
| Llama-3.3-70B | 55.1 (+28.4) | 97.0 (+13.8) | 77.2 (+21.9) |
| Base Model + CONVOLVE | |||
| Llama-3.3-8B + 8B | 7.9 (+1.0) | 94.1 (+2.0) | 39.5 (+0.1) |
| Llama-3.3-8B + 70B | 12.9 (+6.0) | 94.1 (+10.9) | 39.7 (+0.3) |
| Llama-3.3-70B + 8B | 55.3 (+28.6) | 99.0 (+15.8) | 76.4 (+21.1) |
| Llama-3.3-70B + 70B | 57.4 (+30.7) | 100.0 (+16.8) | 78.9 (+23.6) |
Key findings:
- CONVOLVE using only 20 training trajectories substantially improves timely abstention for Llama-3.3-70B (AbsRec@1: 26.7 → 57.4; AbsRec@10: 83.2 → 100.0; SPL: 55.3 → 78.9).
- Lessons learned by a smaller model (8B) transfer effectively to a larger model (70B), achieving gains close to those from the larger model's own lessons (AbsRec@1: 55.3 vs. 57.4).
Theoretical and Practical Implications
This work demonstrates that reliable agents require not only stronger task-completion abilities, but also better judgment about when continued action is no longer useful. The finding that many agents fail to abstain—or abstain only after wasteful interaction—highlights a critical gap in current agent evaluation and design.
Key implications:
- Timely abstention is a distinct capability that does not automatically improve with model scale or reasoning; larger models may even worsen timely recall.
- Agent scaffolds matter significantly: in terminal settings, Codex CLI doubles the abstention recall of Terminus 2, suggesting that the structure of agent-environment interaction can either facilitate or hinder abstention.
- CONVOLVE offers a practical, parameter-free method for improving abstention by distilling experience into context. Its data efficiency (20 trajectories) and cross-model transferability (small-to-large) make it a lightweight approach for deploying more reliable agents.
- Over-abstention on solvable tasks is a practical risk, especially in web environments; methods like CONVOLVE that incorporate balanced training data can help mitigate false abstentions.
Conclusion
The paper introduces Agentic Abstention as a critical problem for LLM agents: deciding when to stop acting rather than continue interacting with an environment. Through evaluations across web shopping, terminal interaction, and interactive QA, the authors find that agents systematically struggle with abstention, often acting unnecessarily or failing to abstain at all. The timing of abstention is especially challenging—agents typically abstain too late, wasting tool calls.
The proposed method, CONVOLVE, demonstrates that context engineering—distilling interaction experience into reusable stopping rules—can substantially improve timely abstention without parameter updates. Key results show that smaller models can generate lessons that transfer to larger models, and that even a small amount of training data (20 examples) yields large gains.
Future work should explore better abstention mechanisms integrated into agent training, more comprehensive benchmarks covering diverse environments and abstention scenarios, and methods that balance timely abstention with avoiding over-abstention on solvable tasks. The findings underscore that building truly reliable agents requires not only stronger abilities to act, but also the wisdom to know when not to act.
Related papers
- The Verification Horizon: No Silver Bullet for Coding Agent Rewards
Verification is now the harder problem for coding agents and must co-evolve with the generator to prevent reward hacking.
- OpenRath: Session-Centered Runtime State for Agent Systems
OpenRath introduces Session as a first-class runtime value for multi-agent systems, making state branchable, inspectable, and replayable.
- GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents
No current method excels at utility, access control, and active forgetting in shared-memory agent benchmarks, with long-context prompting best but costly.