OneManCompany (OMC): A Framework for Organising Heterogeneous Agents as a Real-World Company

Summary (Overview)

  • Core Innovation: Introduces the OneManCompany (OMC) framework, which elevates multi-agent systems to an organisational level. It treats AI agent workforces as self-governing companies with structured coordination, managed lifecycles, and experience-driven evolution.
  • Key Abstraction: Proposes the Talent-Container Architecture, which decouples an agent's portable identity (Talent) from its execution runtime (Container), enabling heterogeneous agent backends (e.g., LangGraph, Claude Code) to interoperate under a unified organisational layer.
  • Dynamic Coordination: Develops the Explore-Execute-Review (E²R) Tree Search, a hierarchical decision loop that dynamically decomposes tasks, executes them via agents, and reviews outcomes, providing formal guarantees on termination and deadlock freedom.
  • Persistent Improvement: Implements dual-level self-evolution mechanisms. At the individual level, agents refine their working principles; at the organisational level, project retrospectives update Standard Operating Procedures (SOPs), and a formal HR pipeline (performance reviews, PIP, offboarding) ensures accountability.
  • Empirical Performance: On the PRDBench software development benchmark, OMC achieves an 84.67% success rate, surpassing the state-of-the-art by 15.48 percentage points. Cross-domain case studies (content generation, game dev, audiobook production, research survey) demonstrate its generality.

Introduction and Theoretical Foundation

Recent advances in LLMs have created highly capable individual AI agents through modular skills and tool integrations. However, these skills operate within a single agent and do not address how multiple agents should work together. Existing multi-agent systems (e.g., CrewAI, AutoGen) are limited by:

  • Brittle, fixed team structures that cannot adapt to novel projects.
  • Tight coupling between agents and their specific runtimes, preventing interoperability.
  • Session-bound learning where improvements do not persist across projects.
  • A lack of formal guarantees on coordination and convergence.

The paper identifies a fundamental research gap: the absence of a principled organisational layer that governs how a workforce of agents is assembled, coordinated, and improved over time, decoupled from individual agent capabilities.

Definition 1 (AI Organisation): a self-governing system of heterogeneous agents with structured coordination, managed lifecycles, and experience-driven evolution.

This perspective introduces an organisation-level abstraction, distinct from capability-level (skills) or interaction-level (multi-agent systems) abstractions. It answers: "How should a workforce of agents be structured and managed to achieve complex goals?" The authors argue this is analogous to organisation design in human companies, where structure is decoupled from individual employee knowledge, enabling generalisation across domains.

Methodology

OMC is designed to mirror a real company's operations, built on three core pillars.

1. Talent-Container Architecture & Digital Talent Market

This pillar addresses workforce management and heterogeneous agent interoperability.

  • Core Abstractions:
    • Talent: A portable agent identity package encapsulating role, prompts, skills, tools, and working principles. It defines who the agent is.
    • Container: The execution environment (runtime backend) that hosts a Talent, abstracting over heterogeneous backends (LangGraph, Claude CLI, scripts). It defines where the agent runs.
    • Employee: The composition Employee = Talent + Container, a fully managed AI agent.
  • Organisational Interfaces: The Container exposes six typed interfaces to the platform, standardising agent-platform interaction (analogous to an OS kernel):
    1. Execution: execute(task, ctx) → (result, cost)
    2. Task: Manages per-employee queue with mutual exclusion.
    3. Event: Organisational publish/subscribe event bus.
    4. Storage: Handles persistent memory.
    5. Context: Assembles the execution prompt from the Talent's identity and memory.
    6. Lifecycle: Applies pre-/post-execution hooks for validation and self-improvement.
# Algorithm 1: Talent Assembly (simplified)
Ensure: employee 𝑒 with Container 𝑉𝑒, Talent 𝜏𝑒, task node 𝑣
Ensure: result 𝑟, cost 𝑐
1: if 𝑒 has a running task then
2:    Enqueue(𝑣) and return  # Mutual exclusion
3: end if
4: ctx ← AssembleContext(𝜏𝑒.role, 𝜏𝑒.principles, GetGuidance(𝑒), GetMemory(𝑒))
5: 𝑣, ctx ← PreHook(𝑣, ctx)  # Guardrails, validation
6: 𝑟, 𝑐 ← 𝑉𝑒.Execute(𝑣.desc, ctx, 𝜏𝑒.tools)  # Dispatch to backend
7: PostHook(𝑒, 𝑣, 𝑟)  # Self-reflection, principle updates
8: Publish(𝜀(𝑣, Processing → Completed, 𝑒, 𝑡_now))
9: return 𝑟, 𝑐
  • Digital Talent Market: A community-driven marketplace supplying verified, ready-to-deploy Talent packages. It supports three sourcing channels:
    1. Community-contributed Talents: Open-source agent packages.
    2. AI-recommended assembly: AI engine assembles skills from the web into Talents.
    3. Internal promotion: High-performing employees' profiles are packaged and shared back. When a project lacks a required capability, the HR agent triggers a recruitment pipeline from the Market.

Table 1: Skills vs. Talents

AspectSkills & Skill MarketsTalents & Talent Markets
LevelInside one agentAcross a team of agents
What it isSmall reusable tools/functionsFull agents with roles, tools, and behaviour
PurposeMake one agent more capableBuild and run a team to solve tasks
RuntimeTied to one system/frameworkCan run across different systems
FlexibilityUsually fixed before executionCan be added, replaced, or reconfigured on the fly
LifecycleNo clear lifecycleManaged lifecycle (hire, evaluate, replace)
AnalogySoftware libraries (APIs)Employees and job markets

2. Explore-Execute 2 Review (E²R) Tree Search

This pillar models project execution as a search over organisational strategies, unifying planning, execution, and evaluation.

  • Tree Structure: The search operates over a dynamic tree T=(V,Etree,Edep)\mathcal{T} = (V, E_{tree}, E_{dep}).
    • Nodes vVv \in V represent organisational states, each carrying: (dv,ev,ϕv,rv,cv,W,R)(d_v, e_v, \phi_v, r_v, c_v, \mathcal{W}, \mathcal{R}) where dvd_v is the task description, eve_v the assigned employee, ϕv\phi_v the status, rvr_v the result, cvc_v the cost, W\mathcal{W} the workforce state, and R\mathcal{R} the resource state.
    • Edges: Decomposition edges EtreeE_{tree} form a strict tree (parent-child relationships). Dependency edges EdepE_{dep} encode execution ordering constraints. The combined graph must be a Directed Acyclic Graph (DAG).
  • Three-Stage Loop:
    1. Explore (Strategy Selection): A supervising agent (policy π\pi) selects a strategy: how to decompose the current task and whom to assign. The composite operation creates a child node: vnew=Δ(v,e,d,D)v_{new} = \Delta(v, e, d, D)
    2. Execute (Work Carried Out): Assigned employees execute their tasks via their internal function: (rv,cv)=fev(dv)(r_v, c_v) = f_{e_v}(d_v)
    3. Review (Quality Signal Propagation): A reviewer evaluates the result rvr_v and produces a quality signal qv{accept,reject}q_v \in \{\text{accept}, \text{reject}\}, which propagates bottom-up: g(v)=qv,cv,ϕv,vpath(leaf,root)\mathbf{g}(v) = \langle q_v, c_v, \phi_v \rangle, \forall v \in \text{path}(\text{leaf}, \text{root}) A reject triggers re-entry into the Explore stage for that subtree.

3. DAG-based Task Execution & Guarantees

The E²R tree search is complemented by a formal DAG execution layer that provides reliability guarantees.

  • AND-Semantics for Completion: A node vv is resolved recursively: \phi_v \in \{\text{accepted}, \text{finished}\} & \text{if $v$ is a leaf} \\ \forall v' \in \text{children}(v) \setminus S: \text{resolved}(v') & \text{otherwise} \end{cases}$$ where $S$ is a set of system node types. This ensures completion propagates bottom-up from all leaves.
  • Task Lifecycle Finite State Machine (FSM): Each node follows an FSM (see Figure 5) with states Φ={pending,processing,holding,completed,accepted,failed,blocked,finished,cancelled}\Phi = \{\text{pending}, \text{processing}, \text{holding}, \text{completed}, \text{accepted}, \text{failed}, \text{blocked}, \text{finished}, \text{cancelled}\}. The critical design is the explicit review gate (completedaccepted), preventing unverified results from propagating.
  • Scheduling & Guarantees: A node becomes executable (ready) when all its dependencies are satisfied (accepted/finished). The system enforces seven invariants, including DAG acyclicity, mutual exclusion per employee, schedule idempotency, review termination limits, and cascade completeness for cancellation. These guarantee termination and deadlock freedom under bounded retry and resource constraints.

4. Self-Evolution Mechanisms

This pillar enables persistent improvement at both individual and organisational levels.

  • Individual-Level Evolution: Agents maintain auto-updating profiles. Improvement is triggered by:
    • CEO one-on-ones: Structured self-reflection based on CEO feedback, updating the agent's working principles.
    • Post-task review: The agent reviews its own execution trace and appends lessons to its progress log.
  • Organisation-Level Evolution:
    • Project Retrospectives: Upon project completion, the COO aggregates self-assessments and objective signals to produce individual feedback and update organisational Standard Operating Procedures (SOPs).
    • HR Performance Pipeline: Every three projects, the HR agent conducts formal performance reviews. Employees failing three consecutive reviews enter a Performance Improvement Plan (PIP). Failure under PIP triggers automated offboarding, closing the loop with the Talent Market for replacement.

Empirical Validation / Results

1. Quantitative Evaluation on PRDBench

OMC was evaluated on PRDBench, a benchmark of 50 project-level software development tasks defined by Product Requirement Documents (PRDs).

  • Setup: Zero-shot, single-attempt (DEV mode). The founding team (Gemini 2.1 Flash Lite Preview) was supplemented by three Talents recruited from the Market: a Software Engineer (Claude Code), a Software Architect (Claude Code), and a Code Reviewer.
  • Primary Metric: Success Rate (% of tasks successfully completed).

Table 2: Performance Comparison on PRDBench

Agent TypeMethodSuccess Rate (%)Cost ($)
MinimalGPT-5.262.49-
MinimalClaude-4.569.19-
CommercialClaude Code56.65-
Multi-agentOurs (OMC)84.67345.59
Note: Cost data for baselines was not reported. OMC cost is ~$6.91 per task.

Result: OMC achieves a state-of-the-art success rate of 84.67%, surpassing the best baseline by 15.48 percentage points.

2. Cross-Domain Case Studies

Four case studies demonstrate OMC's generality and key capabilities.

  1. Dynamic Team Assembly for Content Generation:

    • Task: Produce a weekly trend summary of AI Agent GitHub repos and email it.
    • Process: CEO prompt → EA/HR recruit Researcher (GPT-4o) & Writer (Claude Sonnet 4) → COO coordinates task tree → Execution & delivery.
    • Outcome: Fully autonomous execution; verified, accurate article delivered in <10 mins for ~$4.48.
  2. Game Development with Human-in-the-Loop:

    • Task: Create a polished street-fighting web game.
    • Process: Recruit Game Developer (Claude Sonnet 4) & Art Designer (Gemini 2.5) → Initial build → Human evaluator rejects due to sprite sheet issue → System creates a new sprite-slicing skill for the Art Designer → Re-execution → Successful delivery.
    • Outcome: Demonstrates iterative feedback-driven re-exploration and runtime skill creation.
  3. Audiobook Development (Cross-Modal Coordination):

    • Task: Produce an illustrated audiobook retelling Peaky Blinders with animal characters.
    • Process: Recruit Novel Writer & AV Producer (Gemini 3.1 Pro) → Sequential pipeline: scriptwriting → scene illustration (16 scenes) → voice-over synthesis → video assembly.
    • Outcome: Successful cross-modal coordination across heterogeneous agents for ~$1.57.
  4. Automated Research Survey:

    • Task: Survey "world models for embodied AI (2021-2026)" and propose three research ideas.
    • Process: Recruit three domain specialists from Talent Market → Parallel literature review (35 papers) → Synthesis of failure modes → Generation of three grounded research ideas.
    • Outcome: Complete survey with mind map and novel, actionable research proposals produced in <1 hour for $16.26.

Table 3: Autonomously Generated Research Ideas from Survey

IdeaProblem AddressedTarget VenueKey Technique
HiTeWMCompounding prediction error beyond 15 stepsNeurIPS/ICLRTwo-level (fast 50Hz + slow 2Hz) architecture with uncertainty-gated re-grounding
PhysWMPhysical implausibility in video-based WMsICML/CoRLDifferentiable physics constraints injected into latent dynamics
MAWMSim-to-real domain shift + overconfident hallucinationCoRL/ICLRMeta-learning across sim domains + conformal prediction for calibrated uncertainty

Theoretical and Practical Implications

  • Bridging a Research Gap: OMC introduces the missing organisational layer for AI workforces, providing a domain-agnostic structure for coordination, resource allocation, and improvement, decoupled from individual agent knowledge.
  • Formal Guarantees: The DAG-based execution layer with FSM and AND-semantics provides provable termination and deadlock freedom under dynamic task decomposition, addressing a critical weakness in existing multi-agent systems.
  • Practical Generality: The human-company analogy and the decoupled Talent-Container architecture make the framework applicable across diverse domains (software, content, creative, research), as evidenced by the case studies.
  • Ecosystem Enablement: The Digital Talent Market creates a community-driven supply chain for agent capabilities, enabling on-demand recruitment and fostering an ecosystem of reusable, verified agent identities.
  • Cost-Performance Trade-off: While OMC incurs multi-agent coordination overhead, its high success rate on complex, project-level tasks justifies the cost for scenarios where correctness is paramount. The framework includes an adaptive dispatch mode to route simple tasks to a single agent.

Conclusion

The paper argues that AI organisation design is a crucial next stage for multi-agent systems. The OneManCompany (OMC) framework demonstrates that principles from human organisational management can be successfully transferred to coordinate heterogeneous AI agent workforces.

Core contributions:

  1. A Talent-Container architecture that decouples agent identity from runtime, enabling heterogeneous interoperability.
  2. An Explore-Execute-Review tree search with a DAG-based execution layer for dynamic, reliable task coordination with formal guarantees.
  3. Dual-level self-evolution mechanisms (individual reflection, organisational SOPs, HR pipeline) for persistent improvement.
  4. A Digital Talent Market for on-demand, community-driven agent recruitment.

Empirical results show OMC achieves state-of-the-art performance on complex software development tasks and demonstrates compelling generality across multiple domains. Future work includes large-scale evaluation on non-coding benchmarks, ablation studies of the self-evolution components, and growth of the Talent Market ecosystem.