Summary

Summary (Overview)

  • Conceptual Framing: The paper introduces the concept of "Code as Agent Harness", reframing code from a generated artifact into the operational substrate for executable, verifiable, and stateful AI agent systems. This shifts the focus from producing correct programs to understanding how code supports reliable closed-loop agentic behavior.
  • Structured Taxonomy: The survey organizes the literature into three connected layers: Harness Interface (code for reasoning, acting, environment modeling), Harness Mechanisms (planning, memory, tool use, control, optimization), and Scaling the Harness (multi-agent orchestration over code).
  • Applications Across Domains: The paper connects the taxonomy to five real-world application domains: Coding Assistants, GUI/OS Agents, Embodied Agents, Scientific Discovery, and Personalization, demonstrating the tangible impact of the code-as-harness paradigm.
  • Open Challenges: It outlines critical open problems in harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments.
  • Unified Roadmap: By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

Introduction and Theoretical Foundation

Recent Large Language Models (LLMs) have demonstrated strong capabilities in understanding and generating code. However, the paper argues for a paradigm shift: in emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. The paper frames this shift through the lens of agent harnesses.

An agent harness refers to the software layer that surrounds an LLM with tools, APIs, sandboxes, memory, validators, permission boundaries, execution loops, and feedback channels, thereby turning a stateless model into a functional agent capable of long-running task execution.

The authors clarify three coupled elements of long-running agentic systems:

  1. Model-internal capabilities: The model's reasoning, perception, planning, etc.
  2. System-provided harness infrastructure: Predefined tools, APIs, sandboxes, memory systems, etc.
  3. Agent-initiated code artifacts: Interactive code objects that agents create, execute, observe, revise, persist, and share within the task execution loop (e.g., regression tests, temporary tools, DSL programs).

The central thesis is "code as agent harness": code as the executable and inspectable medium through which agents reason, act, and adapt. This view centers on agent-initiated code artifacts and how model capabilities construct and evolve them through interaction with harness infrastructure.

The survey is organized around three connected layers, as shown in Figure 1:

  1. Harness Interface (\triangleright §2): Code connects agents to reasoning, action, and environment modeling.
  2. Harness Mechanisms (\triangleright §3): Planning, memory, tool use, feedback-driven control, and optimization for long-horizon execution.
  3. Scaling the Harness (\triangleright §4): Shared code artifacts support multi-agent coordination, review, and verification.

Methodology

This is a survey paper that synthesizes and organizes existing literature up to 2026. The methodology is based on a taxonomic analysis, categorizing representative methods, systems, and applications within the proposed "code as agent harness" framework.

The core methodological approach is to analyze the literature through the lens of harness engineering, distinguishing between:

  • Model-internal capabilities
  • System-provided harness infrastructure
  • Agent-initiated code artifacts

The paper systematically reviews and classifies works across the three-layer taxonomy, providing:

  • Conceptual Framing and definitions.
  • Roadmaps and Figures (e.g., Figure 1, 3, 4, 10, 12) visualizing the taxonomy and chronological development.
  • Representative Tables (e.g., Tables 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11) summarizing key systems and their mechanisms.
  • Analysis of Application Domains and Open Problems.

Empirical Validation / Results

The paper does not present new experimental results but synthesizes findings from numerous cited works across the taxonomy. Key empirical insights and validated trends from the literature are summarized below.

Harness Interface: Code for Reasoning, Acting, and Environment Modeling

  • Code for Reasoning: Methods like Program-of-Thoughts (PoT) [6], PAL [7], and Chain of Code (CoC) [8] demonstrate that delegating computation to executable programs substantially improves reliability over pure language-based reasoning. Systems like NExT [30] and CodePRM [31] show that iterative, execution-grounded reasoning loops further enhance performance.
  • Code for Acting: In embodied settings, SayCan [9] and Code-as-Policies (CaP) [10] show that code serves as an effective interface between high-level intent and grounded, executable robot policies. Voyager [32] demonstrates that a persistent, evolving code-based skill library enables open-ended task performance.
  • Code for Environment: Benchmarks like SWE-bench [5] and AgentBench [12] validate that executable environments (using unit tests, sandbox feedback) provide objective, verifiable evaluation signals. Code2World [38] shows that representing GUI state as renderable HTML enables precise environment modeling.

Harness Mechanisms: Planning, Memory, Tool Use, Control, and Optimization

.

  • Planning: Self-Planning [40] and WebAgent [41] show linear decomposition improves web automation. CodePlan [42] demonstrates structure-grounded planning via dependency graphs enhances repository-level editing coherence.
  • Memory: SWE-agent [57] and CodeMem [45] show that structured working memory management is critical for multi-step repository repair. RepoCoder [47] and CodeRAG [187] validate that semantic memory retrieval aligned with repository structure improves generation quality.
  • Tool Use: ToolCoder [19] demonstrates that API search tools effectively ground generation in external knowledge. SWE-agent [57] shows that environment-interaction tools (shell, editor) are essential for real-world software engineering tasks.
  • Control (PEV Loop): AgentCoder [50] and Self-Debugging [243] validate that test-based verification feedback drives effective iterative repair. Systems like OpenHands [250] demonstrate that sandboxed execution and permissioned state transitions are necessary for safe, long-horizon tasks.
  • Harness Optimization: AutoHarness [14] and Meta-Harness [13] provide evidence that the harness itself can be synthesized and optimized, leading to improved agent reliability.

Scaling the Harness: Multi-Agent Orchestration over Code

  • Multi-Agent Collaboration: Systems like ChatDev [330], MetaGPT [55], and AgentCoder [50] empirically show that role specialization (planner, coder, tester, reviewer) and structured interaction over shared code artifacts outperform single-agent baselines on complex software tasks.
  • Execution Feedback and Synchronization: L2MAC [344] demonstrates that explicit blackboard-style shared state management mitigates context-window limitations. EvoMAC [328] shows that feedback-driven DAG restructuring can adapt multi-agent topology for better performance.

Applications

  • Code Assistants: SWE-agent [57] achieves competitive results on the SWE-bench repository repair benchmark. LingmaAgent [372] reports resolving 16.9% of in-house cloud issues fully autonomously in a production deployment.
  • GUI/OS Agents: OSWorld [396] and AndroidWorld [391] provide rigorous evaluation showing agents can complete hundreds of real OS tasks. Production systems like Claude Computer Use [427] demonstrate the feasibility of deploying such agents.
  • Scientific Discovery: Coscientist [62] autonomously plans and executes real chemical synthesis experiments. AI Scientist [438] generates complete ML research papers, including code and figures, from high-level goals.
  • Embodied Agents: Voyager [32] exhibits continual skill acquisition in Minecraft. CaP [10] successfully controls physical robots using generated Python policies.
  • Personalization: Agent4Rec [467] and iAgent [468] simulate interactive recommendation sessions, showing the potential of agentic loops over static ranking.

Theoretical and Practical Implications

Theoretical Implications

  1. Unified Framework: The "code as agent harness" concept provides a unifying theoretical lens to analyze diverse agentic systems, connecting reasoning, action, environment, and multi-agent coordination through the executable medium of code.
  2. Shift in Focus: It theoretically shifts the bottleneck of autonomy from model-internal reasoning ability to the reliability of the system (the harness) that connects model outputs to long-horizon actions and persistent states.
  3. Formalization of Harness State: The paper advances the idea of harness state as a first-class citizen, encompassing working memory, repository evidence, execution traces, and multi-agent beliefs, which must be managed and synchronized.
  4. Cybernetics of Agent Control: The Plan–Execute–Verify (PEV) loop is framed as a cybernetic control process, where the harness acts as a governor, using deterministic sensors (tests, analyzers) to regulate agent trajectory over executable program state.

Practical Implications

  1. Harness Engineering as a Discipline: The survey legitimizes "harness engineering" as a critical practice for building reliable AI agents, on par with model training and prompt engineering.
  2. Design Blueprints: The three-layer taxonomy and analysis of mechanisms provide practical design blueprints for developers building code-centric agent systems in areas like coding assistants, automation, and scientific discovery.
  3. Evaluation Standards: It highlights the need for new evaluation paradigms that measure harness-level properties (trajectory efficiency, verification strength, safety compliance) beyond final task success.
  4. Safety and Governance: The discussion on permission tiers, sandboxes, and human-in-the-loop gates provides a practical roadmap for implementing safety and accountability in autonomous agent deployments.
  5. Interoperability and Protocols: The emphasis on shared code substrates and protocols like the Model Context Protocol (MCP) points toward standards for tool and state interoperability across different agent platforms.

Conclusion

The paper concludes that code is the unifying harness for executable, verifiable, and stateful AI agent systems. By externalizing reasoning, grounding action, representing environment state, and coordinating multiple agents, code transforms LLMs from stateless text generators into functional, long-horizon actors.

The key takeaways are:

  • The harness interface makes reasoning executable, action programmable, and environment state inspectable.
  • Harness mechanisms (planning, memory, tool use, control, optimization) sustain agents over long execution and revision.
  • Scaling the harness via multi-agent orchestration over shared code artifacts enables complex, collaborative problem-solving.

The survey outlines a future research agenda centered on open problems in harness-level evaluation, semantic verification, self-evolving harnesses, transactional shared state, human-in-the-loop safety, and multimodal systems. The ultimate goal is to develop a "science of harness engineering" to build agentic systems that are not only capable but also reliable, inspectable, and governed.