Summary: Terminal Agents Suffice for Enterprise Automation

Summary (Overview)

Minimalism over Complexity: A simple coding agent equipped only with a terminal and filesystem, interacting directly with platform APIs, can effectively solve many enterprise automation tasks.
Performance and Efficiency: These terminal-based agents match or outperform more complex agent architectures (GUI-driven web agents and MCP-based tool-augmented agents) while achieving significantly lower operational costs (often by a factor of 5 or more).
Direct API Interaction: The success stems from the flexibility of direct programmatic interaction, which avoids the brittleness of GUI navigation and the expressivity constraints of predefined tool schemas.
Parametric Knowledge vs. External Docs: Terminal agents often perform effectively using their internal parametric knowledge; external platform documentation can be helpful or harmful depending on its structure.
Persistent Skills Improve Efficiency: Allowing agents to create and reuse "skills" (persistent procedures and notes) leads to improved success rates and reduced costs, especially on platforms with less common APIs.

Introduction and Theoretical Foundation

Large Language Models (LLMs) have evolved from code assistants to agents capable of executing multi-step tasks across software systems. In enterprise settings, this shift is significant as agents are expected to perceive system state, reason about business context, and perform actions that modify operational data under real-world constraints.

Two prominent architectural directions have emerged to address these challenges:

GUI-driven web agents that operate through browser interfaces (e.g., using DOM elements and screenshots).
Tool-augmented agents that expose curated action schemas through frameworks like Model Context Protocol (MCP).

Both approaches introduce structured abstractions between the model and the underlying platform. However, these abstractions come with tradeoffs: GUI agents face brittle, long action chains, while curated tool registries restrict expressivity to predefined operations.

Modern enterprise platforms already expose expressive APIs for programmatic interaction. Recent generalist code agents (e.g., Claude Code, OpenClaw) demonstrate strong performance on complex tasks without heavy abstraction layers by operating directly over programmable interfaces. This work hypothesizes that additional abstraction layers may be unnecessary when stable APIs are available. It empirically tests whether minimal terminal-based coding agents, interacting directly with APIs, are sufficient for practical enterprise automation.

Methodology

The study compares three agent interaction paradigms, all using the same LLM backbone, to isolate the effect of the interaction modality:

Tool-augmented (MCP) Agents: Operate through a curated set of API tools exposed via MCP servers.
Web Agents: Operate through graphical interfaces using a Playwright MCP server, issuing low-level browser actions.
Terminal Agents: The primary focus. A simple coding agent (StarShell) that operates through a terminal and filesystem. Instead of invoking predefined tools, it writes and executes code (primarily curl commands) to interact directly with platform APIs.

StarShell: A Terminal-Based Enterprise Agent

StarShell is a minimal coding-agent environment with two primary interfaces:

A terminal for executing commands.
A filesystem for storing artifacts (documentation, cached results, persistent "skills").

The agent operates in a REPL-style loop: it receives a task, generates commands/code to run, observes the outputs, and iteratively reasons and corrects. It discovers platform capabilities dynamically rather than relying on predefined action schemas.

Benchmark Environments

Agents are evaluated across three production-grade enterprise platforms, representing common software categories:

Platform	Category	Samples	MCP Tools	Doc. Pages
ServiceNow	IT Service Management	330	93	61k
GitLab	Software Development Lifecycle	192	107	2.65k
ERPNext	Enterprise Resource Planning	207	7	5.41k

Table 1: Summary statistics of the evaluation benchmark across three enterprise platforms.

Each benchmark consists of natural-language tasks requiring agents to inspect system state, retrieve information, and perform actions that modify platform records. Tasks range from simple queries to multi-step workflows.

Experimental Setup

Agent Implementations: All built using the OpenAI Agents SDK. Models evaluated include Claude Sonnet 4.6, Claude Opus 4.6, GPT-5.4 Thinking, and Gemini 3.1 Pro.
Metrics:
- Primary: Success Rate (SR) - percentage of tasks successfully completed based on system state verification.
- Efficiency: Inference Cost - computed from token usage of the underlying LLM.
Experimental Design: Two-stage evaluation:
1. Compare the three paradigms under a minimal configuration (no docs/skills).
2. Introduce capability modules (documentation access, skill persistence) via controlled ablations.

Empirical Validation / Results

4.1 Comparing Types of Agents

The main results comparing the three agent paradigms across four LLMs are summarized below:

Table 2: Main results across agent interaction paradigms. Success rate (SR, ↑) and average cost per task (↓) for MCP, web, and terminal agents on three enterprise platforms with four backbone LLMs. Bold indicates the best SR and lowest cost within each platform–model group. | Agent | ServiceNow (330) | GitLab (192) | ERPNext (207) | Overall (729) | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Claude Sonnet 4.6 | | | | | | | | | | MCP | 11.5% | $0.76 | 45.2% | $0.48 | 55.6% | $0.14 | 32.9% | $0.51 | | Web | 72.4% | $4.49 | 82.9% | $0.88 | 61.8% | $3.63 | 72.2% | $3.29 | | Terminal | 73.6% | $0.78 | 76.5% | $0.28 | 67.6% | $0.46 | 72.7% | $0.56 | | Claude Opus 4.6 | | | | | | | | | | MCP | 16.1% | $0.66 | 46.8% | $0.90 | 68.9% | $0.17 | 39.2% | $0.58 | | Web | 77.6% | $4.21 | 81.9% | $0.85 | 81.6% | $6.49 | 79.9% | $3.97 | | Terminal | 79.1% | $1.94 | 80.2% | $0.50 | 76.8% | $0.72 | 78.7% | $1.22 | | GPT-5.4 Thinking (Medium) | | | | | | | | | | MCP | 18.5% | $0.14 | 47.9% | $0.40 | 62.8% | $0.21 | 38.8% | $0.23 | | Web | 69.4% | $0.54 | 81.4% | $0.17 | 72.5% | $0.51 | 73.4% | $0.43 | | Terminal | 77.0% | $0.20 | 71.3% | $0.13 | 70.0% | $0.24 | 73.5% | $0.19 | | Gemini 3.1 Pro | | | | | | | | | | MCP | 14.2% | $0.10 | 48.9% | $0.15 | 62.8% | $0.07 | 37.1% | $0.11 | | Web | 62.1% | $0.68 | 84.6% | $0.22 | 65.2% | $1.13 | 68.9% | $0.69 | | Terminal | 78.5% | $0.10 | 79.8% | $0.06 | 73.9% | $0.10 | 77.5% | $0.09 |

Key Findings:

MCP Agents: Achieve the lowest success rates, limited by tool coverage and rigid interfaces. They are consistently the cheapest, confirming the bottleneck is expressivity, not efficiency.
Web Agents: Offer high flexibility and strong success rates (best or tied in 8/12 combinations) but at a substantial cost premium (often 4-9x more expensive than terminal agents).
Terminal Agents: Provide the best cost-performance tradeoff. They match or exceed web agent accuracy in 7/12 combinations while consistently costing significantly less. With Gemini 3.1 Pro, the terminal agent achieved 77.5% success at just $0.09 per task.

4.2 Parametric Knowledge vs. Documentation

Table 3: Effect of documentation access on terminal agents. Success rate and cost per task for terminal agents with and without access to official documentation. | Agent | ServiceNow (330) | GitLab (192) | ERPNext (207) | Overall (729) | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Claude Sonnet 4.6 | | | | | | | | | | No Docs | 73.6% | $0.78 | 76.5% | $0.28 | 67.6% | $0.46 | 72.7% | $0.56 | | With Docs | 67.3% | $1.10 | 79.1% | $0.48 | 72.3% | $0.47 | 71.8% | $0.76 | | Claude Opus 4.6 | | | | | | | | | | No Docs | 79.1% | $1.94 | 80.2% | $0.50 | 76.8% | $0.72 | 78.7% | $1.22 | | With Docs | 81.2% | $1.60 | 78.7% | $0.71 | 76.3% | $0.99 | 79.2% | $1.19 |

Overall, documentation does not provide a clear aggregate benefit, suggesting agents can often operate effectively using internalized knowledge or API discoverability. Effects vary by platform:

ServiceNow: Documentation can hurt performance and increase cost (agents spend time on retrieval).
ERPNext: Documentation can help (e.g., clarifying non-obvious field names).
GitLab: Little accuracy impact but increased cost.

4.3 Access to Self-generated Skills

Allowing agents to persistently store and reuse "skills" (procedures, notes) improves success rates and reduces costs.

Figure 3: Skills accumulation over sequential tasks. The agent with memory (blue) accumulates reusable procedures; the baseline (black) starts fresh every time.

Top: Cumulative number of successful tasks.
Middle: Cumulative cost ($USD).
Bottom: Skills directory size (KB).

Key Findings (using Claude Sonnet 4.6):

Success Rate Improvement: Largest on ERPNext (+5.8 percentage points), moderate on ServiceNow (+3.6pp), marginal on GitLab (+1.6pp).
Cost Reduction: Significant on ServiceNow (43.7% less per task: $0.44 vs. $0.78) and ERPNext (16.8% less).
Memory Growth: Skills directory grows rapidly early on, then plateaus as patterns are learned. GitLab shows minimal growth, indicating stronger parametric knowledge of its API.

Theoretical and Practical Implications

Sufficiency of Simple Interfaces: The findings challenge the assumption that increasingly sophisticated agent stacks are required for enterprise automation. When platforms provide stable, expressive APIs, lightweight coding agents that interact directly with those APIs can be sufficient for a broad class of tasks.
Cost-Effective Automation: Terminal agents offer a superior cost-performance profile, making enterprise automation more economically viable by avoiding the overhead of GUI rendering or extensive tool engineering.
Flexibility over Structure: The direct programmatic interaction paradigm provides greater flexibility than curated tool registries, allowing agents to compose operations not pre-defined in schemas and recover from errors through exploration and scripting.
Design of Supporting Materials: The mixed results on documentation access suggest that documentation must be structured for its consumer. Human-oriented reference docs can mislead agents, while concise, task-oriented content is more helpful. This motivates the use of agent-generated or human-authored "skills" as an effective complement.
Hybrid Approaches: While terminal agents excel, some tasks are inherently tied to the UI (e.g., impersonation, reading rendered charts). For comprehensive coverage, hybrid agents with both terminal and browser access may be optimal, especially with stronger base models capable of selecting the right tool per subtask.
Safety and Control: Terminal agents operate with broad execution capabilities, underscoring the need for API-level security controls (permissions, auditing) as a complementary layer to the agent interface.

Conclusion

This work demonstrates that minimal coding agents operating through a terminal and filesystem, interacting directly with platform APIs, are both effective and efficient for practical enterprise automation. They match or outperform more complex GUI-driven and tool-augmented architectures while maintaining significantly lower costs.

The results suggest that enterprise automation may benefit more from exposing stable programmable interfaces than from introducing additional abstraction layers. When such interfaces exist, lightweight agents can dynamically discover and compose functionality without extensive, task-specific tooling.

Future Directions:

Developing benchmarks for long-horizon, cross-platform coordination.
Expanding evaluation to additional enterprise verticals (IT ops, HR, security, finance).
Exploring hybrid agent architectures that optimally combine programmatic and UI-based interaction.
Incorporating safety, reliability, and access control layers on top of the minimal agent abstraction.