# TUA-Bench: A Benchmark for General-Purpose Terminal-Use Agents

> Even the best terminal-use agent achieves only 65.8% success on TUA-Bench, a new benchmark for CLI agents.

- **Source:** [arXiv](https://arxiv.org/abs/2606.28480)
- **Published:** 2026-07-01
- **Permalink:** https://picx.dev/p/8ISXwp
- **Whiteboard:** https://picx.dev/p/8ISXwp/image

## Summary

## Summary (Overview)

- **TUA-Bench** is a new benchmark for evaluating **general-purpose terminal-use agents (TUAs)** operating exclusively via command-line interfaces (CLI), covering both everyday digital workflows and expert scientific tasks.
- It contains **120 manually curated, realistic tasks** across **five families**: Office & Productivity, Web & Information, System & Software Operations, Scientific & Engineering, and Multimedia & Design.
- Everyday tasks are converted from the GUI-based OSWorld benchmark (GUI-to-CLI), and professional scientific tasks are co-designed with PhD-level domain experts in biology, medical physics, architectural engineering, and mechanical engineering.
- The strongest evaluated agent—**Claude Code with Claude Opus 4.8 (max reasoning effort)**—achieves **65.8% success rate**, revealing substantial remaining challenges in long-horizon planning, tool use, execution monitoring, and error recovery.
- TUA-Bench is open-sourced with a reproducible execution environment based on Harbor, along with deterministic setup scripts and automatic execution-based verification.

## Introduction and Theoretical Foundation

Large language models (LLMs) have evolved from conversational tools to programming assistants and, more recently, to autonomous agents capable of complex multi-step workflows. Evaluating computer-use agents has become an important problem. Most existing computer-use benchmarks assume **graphical user interfaces (GUIs)**, requiring agents to combine language reasoning with visual perception (screenshots, coordinate grounding). This introduces perception and grounding challenges that partly measure visual understanding rather than core planning and tool-use ability.

**Command-line interfaces (CLIs)** offer a text-native form of interaction: commands are explicit, feedback is textual, and complex workflows can be composed via scripts, pipes, and specialized programs. These properties align naturally with the strengths of language models. Moreover, many high-value professional workflows—software engineering, data analysis, scientific computing, system administration, and multimedia processing—are already conducted primarily through terminals.

Despite this, existing terminal benchmarks (e.g., Terminal-Bench) focus narrowly on shell-native technical and programming workflows, leaving **general-purpose terminal-based computer use unevaluated**. TUA-Bench fills this gap by combining native command-line interaction with broad task coverage spanning routine digital work and expert professional procedures.

## Methodology

### Task Execution Environment

TUA-Bench is built on top of **Harbor**, the orchestration framework also used by Terminal-Bench. Each task runs inside an isolated, resettable Linux container (Docker or Podman) to ensure reproducibility. The environment includes realistic software, files, optional internet access, and native CLI-based agent interfaces. Each task is packaged as a self-contained specification with:
- Dockerfile, input artifacts, natural-language instructions
- Environment variables, model and runtime settings
- In-environment verifier for execution-based scoring

### Task Curation

The benchmark comprises **120 real-world tasks** from an initial pool of 394 candidates after rigorous human verification and difficulty-aware selection.

**Everyday Digital Tasks (breadth):** Sourced from the GUI-based OSWorld benchmark (369 tasks). Each task is **converted from GUI to CLI**, preserving the underlying user goal but not constraining the tools used. After conversion, human verification removes tasks with input–gold artifact mismatches. Then a **difficulty-aware selection** is applied: each candidate is evaluated with three frontier models (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro) within the Terminus-2 agent framework (5 trials each). The **100 tasks with the lowest solvability** are retained to ensure sustained difficulty.

**Professional Scientific Tasks (depth):** 20 tasks co-designed with PhD-level domain experts across four subjects:

| Subject | Scope | Task Example |
|---------|-------|--------------|
| Biology | Counting/localizing cell nuclei, image-based cytometry | Counting cell nuclei from fluorescence micrographs |
| Medical Physics | Histopathology segmentation, MRI volumetry, anatomical morphometry | Anatomical segmentation and morphometry in medical image computing |
| Architectural Engineering | Building energy performance simulation with OpenStudio/EnergyPlus | Whole-building energy performance reconstruction and simulation |
| Mechanical Engineering | CFD and heat transfer analysis with OpenFOAM | Heater placement, cold-plate optimization, conjugate heat transfer |

Tasks are designed to reflect meaningful professional workflows, require complex multi-step procedures, and support reliable evaluation via programmatic verifiers or LLM-as-a-judge.

### Evaluation Protocol

- **Agent frameworks:** Terminus-2, Codex, OpenHands, Mini-SWE-Agent, Claude Code
- **Models:** GPT-5.5, GPT-5.4 mini, Claude Opus 4.8/4.7, Claude Sonnet 4.6, Claude Haiku 4.5, Gemini 3.1 Pro, GLM-5.1, MiniMax-M3, DeepSeek-V4 Pro, Qwen3.7-Max, Kimi K2.6
- **Metrics:** Execution-grounded task success. For each agent–model–thinking configuration, 5 independent trials per task are run. Reported metrics: mean success rate across all trials, **Pass@1** (single-run), **Pass@5** (best-of-5), and **All-5** (consistent across all 5 trials).
- **Default time limit:** 2400s per task (nearly eliminates timeouts).

## Empirical Validation / Results

### Main Results

**Table 3(a): Model sweep on Terminus-2 agent** (highest reasoning effort)

| Model | Success Rate (%) | Pass@1 | Pass@5 | All-5 |
|-------|----------------:|-------:|-------:|------:|
| GPT-5.5 | 60.1 ± 0.6 | 52.3% | 64.2% | 31.7% |
| Claude Opus 4.8 | 59.7 ± 1.0 | 53.8% | 62.5% | 42.5% |
| Claude Opus 4.7 | 58.0 ± 0.8 | 51.0% | 64.2% | 39.2% |
| Gemini 3.1 Pro | 49.3 ± 1.8 | 41.2% | 57.5% | 24.2% |
| GLM-5.1 | 48.1 ± 1.3 | 40.3% | 59.2% | 20.8% |
| MiniMax-M3 | 47.0 ± 1.3 | 41.2% | 59.2% | 22.5% |
| DeepSeek-V4 Pro | 46.2 ± 0.8 | 38.0% | 57.5% | 18.3% |
| Qwen3.7-Max | 44.9 ± 0.7 | 37.7% | 57.5% | 21.7% |
| Kimi K2.6 | 42.8 ± 1.8 | 35.3% | 55.8% | 18.3% |
| Claude Sonnet 4.6 | 42.8 ± 0.3 | 34.8% | 49.2% | 20.0% |
| GPT-5.4 mini | 27.2 ± 1.4 | 20.0% | 41.7% | 6.7% |
| Claude Haiku 4.5 † | 23.9 ± 1.5 | 15.7% | 30.8% | 3.3% |

† Thinking disabled for Claude Haiku 4.5.

Key findings:
- Frontier models (GPT-5.5, Claude Opus 4.8/4.7) form a leading group near 60%.
- Claude Opus 4.8 achieves higher **All-5 reliability** (42.5%) than GPT-5.5 (31.7%), despite similar average success.
- A clear capability hierarchy exists across Claude family tiers (Opus > Sonnet > Haiku).
- Substantial gaps remain below the frontier.

**Table 3(b): Best model per agent** (highest reasoning effort)

| Agent | Model | Success Rate (%) | Pass@1 | Pass@5 | All-5 |
|-------|-------|----------------:|-------:|-------:|------:|
| Claude Code | Claude Opus 4.8 | **65.8 ± 0.7** | 58.8% | 64.2% | 51.7% |
| Codex | GPT-5.5 | 64.7 ± 0.7 | 57.7% | 68.3% | 42.5% |
| OpenHands | Claude Opus 4.8 | 63.4 ± 0.6 | 57.3% | 67.5% | 45.0% |
| Mini-SWE-Agent | GPT-5.5 | 62.4 ± 0.8 | 54.2% | 67.5% | 40.0% |
| Terminus-2 | GPT-5.5 | 60.1 ± 0.6 | 52.3% | 64.2% | 31.7% |

- **Claude Code + Claude Opus 4.8 (max)** achieves the highest overall success rate (65.8%) and All-5 (51.7%).
- All top agent–model combinations cluster within ~5.7 percentage points, indicating strong frontier models yield competitive results across scaffolds.

### Ablation Studies

**Task-execution time budget:** Increasing the per-task time limit from 150s to 2400s reduces timeouts from 337 to 4 out of 600 trials and raises success rate from 33.0% to 60.1% (Terminus-2 + GPT-5.5). Gains diminish beyond 1200s.

**Thinking-effort scaling:** For GPT-5.5, success rate improves monotonically with reasoning budget: from 36.5% (none) to 60.1% (xhigh). The largest gains occur at lower effort levels (none to medium: +15 points). Beyond high, returns diminish (high to xhigh: +2.3 points, while token cost nearly doubles). Medium-to-high offers best accuracy–cost trade-off.

**Cost–performance trade-off:** Across 39 configurations, cost ranges from ~$12 to $304 per run; success rates from 23.9% to 65.8%. Pareto frontier: low-cost efficiency from Terminus-2 with open-weight models (~47–48% at $12–23/run); highest success from Claude Code + Opus 4.8 (65.8% at $173.61/run). Returns flatten beyond ~$105/run.

**Agent-dependent model performance:** Table 4 shows that model ranking depends on the scaffold.

| Agent | Claude Opus 4.8 | GPT-5.5 | ∆ |
|-------|----------------:|--------:|--:|
| Mini-SWE-Agent | 57.4 | **62.4** | +5.0 |
| OpenHands | **63.4** | 61.4 | −2.0 |
| Terminus-2 | 59.7 | **60.1** | +0.4 |
| **Mean** | **60.2** | **61.3** | +1.1 |

GPT-5.5 leads with Mini-SWE-Agent; Opus 4.8 leads with OpenHands; near-tie with Terminus-2.

**Per-category performance:** GPT-5.5 is most consistent across categories. System & SW is comparatively tractable; Office and Multimedia are consistently difficult. Opus 4.8 leads by a wide margin on Web & Info. Strong within-category heterogeneity exists (Figure 7 heatmap): each category contains both easy and extremely hard tasks, revealing specific capability gaps not captured by averages.

## Theoretical and Practical Implications

- **Benchmark design:** TUA-Bench demonstrates the value of **text-native CLI evaluation** as an alternative to GUI-based benchmarks, avoiding visual grounding challenges while preserving realistic task complexity.
- **Agent capabilities:** The evaluation reveals that even frontier agents struggle with long-horizon planning, tool composition, execution monitoring, and error recovery in terminal environments. The low All-5 scores (best 51.7%) indicate reliability remains a major challenge.
- **Cost-awareness:** The analysis of thinking-effort scaling and cost-performance trade-offs provides practical guidance for deploying terminal agents: medium-to-high reasoning effort offers the best balance, and scaffold choice significantly affects efficiency.
- **Domain coverage:** Including professional scientific tasks co-designed with experts highlights the need for agents to operate specialized domain software, expanding the scope beyond typical coding or office tasks.
- **Scaffold importance:** Relative model performance is not invariant to the agent scaffold; conclusions drawn from a single scaffold may misrepresent model capabilities.

## Conclusion

TUA-Bench is a benchmark for evaluating general-purpose terminal-use agents, containing **120 manually curated tasks** spanning everyday digital work (office, web, multimedia) and professional scientific workflows (biology, medical physics, architectural engineering, mechanical engineering). It uses a reproducible execution environment (Harbor), deterministic setup scripts, and automatic execution-based verification.

The evaluation of 12 models and 5 agent frameworks reveals that even the strongest configuration (Claude Code with Claude Opus 4.8 max) achieves only **65.8% success rate**, highlighting reliable terminal-based computer use as a challenging open problem. Key findings include:
- Frontier models cluster near 60%, with significant gaps to mid-tier models.
- Thinking-effort scaling reliably improves performance with diminishing returns.
- Cost–performance trade-offs vary substantially across scaffolds and models.
- Many tasks remain near-impossible for all current agents.

By open-sourcing TUA-Bench, the authors aim to enable reproducible evaluation, lower the barrier for developing new terminal-use agents, and support community-driven progress toward general-purpose computer-use systems. Future work may expand domain coverage, refresh tasks to avoid data contamination, and maintain compatibility with evolving CLI tools.

---

_Markdown view of https://picx.dev/p/8ISXwp, served by PicX — AI-generated visual whiteboard summaries of research papers._