Summary (Overview)

  • This paper frames the transition of Large Language Models (LLMs) from conversational Chatbots toward persistent, autonomous Digital Colleagues along two tightly coupled dimensions: (1) cognitive-core evolution (from "fast thinking" next-token prediction to "slow thinking" reasoning via inference-time computation, Chain-of-Thought, and reinforcement learning) and (2) tool-augmented task execution (from ad hoc tool-calling Agents to persistent OpenClaw-style workspace systems).
  • The key enabling mechanism is the "Workspace + Skill" paradigm, where a persistent digital workspace (files, terminals, browsers, logs, permissions) provides stateful context for tasks, and reusable, parameterizable skills package procedures, scripts, checks, and safety constraints—turning episodic tool use into durable, inspectable work.
  • Data and evaluation paradigms shift accordingly: from instruction-response pairs and static answer correctness benchmarks toward state–action–observation trajectories and task-closure evaluation (whether the system reaches the intended final state under reproducible, auditable, safe conditions).
  • The survey identifies structural bottlenecks in current systems: long-horizon reliability, memory and state management, safety and governance, and human–AI collaboration ethics. It outlines a roadmap toward self-evolving AI ecosystems where models, workspaces, tools, skills, memories, evaluators, and governance mechanisms continuously convert operational experience into reusable assets.

Introduction and Theoretical Foundation

The paper is motivated by a fundamental transformation: LLMs are no longer limited to generating better answers but must reliably transform user intent into completed work in open-ended digital environments. Early progress was driven by scaling autoregressive Transformers and instruction-aligned chat interfaces, compressing broad world knowledge into fluent single-pass responses (the Chatbot era). More recently, the frontier has shifted toward models that deliberate over difficult problems, invoke tools, interact with environments, and coordinate multi-step workflows (the Thinking LLM, Agent, and OpenClaw eras).

The central question redefines the human–AI relationship: from conversational answers (Chatbot) to persistent, stateful, and collaborative work (Digital Colleague). The paper organizes the field along two complementary dimensions:

  1. Cognitive Core Evolution – How models generate, understand, and reason. This spans two eras:

    • Chatbot Era: LLMs behave like fast "System-1" generators; they compress parametric knowledge and produce fluent responses but struggle with deep reasoning, verification, and long-horizon consistency.
    • Thinking LLM Era: Models leverage inference-time computation, Chain-of-Thought prompting, reflection, process supervision, and reinforcement learning to support slower, more deliberate, and more reliable problem solving.
  2. Tool-Augmented Task Execution – How a stronger cognitive core acts in dynamic external environments. This also spans two eras:

    • Agent Era: LLMs move from passive dialogue into active systems that call APIs, browse websites, write code, manipulate files, and collaborate with other agents in an environment–action–feedback loop. However, these early agents remain fragile: incorrect action formats, missing observations, failed tool calls, or unrecovered intermediate errors can derail the entire trajectory.
    • OpenClaw Era: Tool use is embedded into persistent workspaces with files, terminals, browsers, logs, permissions, reusable skills, and verification procedures, enabling agents to maintain context, monitor progress, recover from failures, and verify final workspace states.

The "Workspace + Skill" Thesis

Within this two-dimensional framework, the key thesis is that Workspace + Skill provides the mechanism that turns chatbot-style interaction into durable digital-colleague work. A Workspace is a persistent digital environment for AI operations, including files, terminals, browsers, editors, repositories, calendars, documents, databases, and domain-specific applications. A Skill is a reusable, parameterizable procedure for completing tasks, including planning, tool sequencing, intermediate checks, error recovery, and validation. Together, they move LLMs beyond episodic responses and atomic tool calls: the workspace provides state, memory, evidence, and consequences, while the skill provides reusable operational knowledge.

Methodology

The survey reviews the field through four parts, each synthesizing a broad body of recent research:

  • Part I: Evolution of the LLM Cognitive Core – Traces the transition from scaling-driven language generation (GPT-3, GPT-4, PaLM, LLaMA, Mixtral) and parametric knowledge compression, through behavioral alignment (FLAN, InstructGPT, RLHF, DPO) and multimodal expansion (GPT-4o, Gemini, LLaVA, InternVL), to the Thinking LLM era characterized by long Chain-of-Thought, inference-time scaling (OpenAI o1, DeepSeek-R1), and reinforcement-learning-driven reasoning (GRPO, DAPO, Dr. GRPO, CISPO, GSPO, SAPO). The methodology analyzes training paradigms from supervised fine-tuning on CoT data to pure RL with verifiable rewards.

  • Part II: Evolution of Tool-Augmented Task Execution – Examines the Agent era's core capabilities: perception (HuggingGPT, Visual ChatGPT, CogAgent, ShowUI), planning (CoT, ToT, GoT, Reflexion, Self-Refine, RAP), memory (Generative Agents, MemoryBank, Mem0, MEM1, Memory-R1), and tool use (Toolformer, PAL, Gorilla, ToolLLM, ToolACE, MCP). It then analyzes structural bottlenecks (fragmented perception, ephemeral tool invocation, brittleness, absence of long-term task closure) that motivate the transition to OpenClaw-style systems. The OpenClaw era is characterized by persistent workspaces, skill-based task closure (OpenClaw, OpenHands, SWE-agent, Voyager, Anthropic Agent Skills), and new challenges in evaluation, reliability, and governance.

  • Part III: Why Workspace + Skill Is the Key Leap – Argues through two complementary dimensions: (1) workspace as the execution substrate for agentic work (from ephemeral tool calls to persistent state, from answer generation to authorized work delegation); (2) skills as reusable procedures (from ad-hoc prompts to composable capability packages, from skill libraries to integrated digital workers). A case study of OpenClaw illustrates the four-stage execution loop: interpret intent → retrieve skills → execute actions → verify final state. Limitations including skill brittleness, overfitting, workspace contamination, and security risks are discussed.

  • Part IV: Data & Evaluation Paradigm Shifts – Analyzes how data evolves from static knowledge corpora and instruction–response pairs (Chatbot), to Chain-of-Thought and process reward data (Thinking LLM), to state–action–observation trajectories (Agent/OpenClaw). Evaluation correspondingly shifts from final-output accuracy, to process judgment (LLM-as-judge, process reward models), to task-closure rate and workspace-level capability/safety metrics. Representative benchmarks (MMLU, GSM8K, MATH, GPQA, SWE-bench, WebArena, OSWorld, ClawsBench, ClawBench, ATBench-Claw, ClawSafety) and model results are tabulated.

Empirical Validation / Results

The paper synthesizes extensive empirical evidence from benchmarks and system evaluations:

Cognitive Core Evolution

  • Inference-time scaling: With sufficient inference-time compute, a 1B model can surpass a 405B model on mathematical benchmarks (e.g., s1, LIMO).
  • Effectiveness of RLVR: Rule-based answer matching plus format checking proved most effective for training reasoning models (DeepSeek-R1). Process reward models (PRMs) offer finer-grained supervision but face annotation cost and reward hacking challenges.
  • Hybrid training workflows: DeepSeek-R1's four-stage pipeline (cold-start SFT, reasoning RL, rejection sampling SFT, general RL) and Qwen3's refined version (Long CoT cold-start, reasoning RL, thinking-mode fusion, general RL) represent the most complete documented workflows.

Tool-Augmented Task Execution

  • Agent benchmarks: Success rates decay super-linearly with complexity and horizon length. GPT-4 achieved only 14% success on WebArena; SWE-bench results show systematic failure patterns.
  • OpenClaw systems: Representative models achieve task success rates of 60-80% on Claw-Eval and ClawsBench, but unsafe action rates remain a concern (13-23% for state-of-the-art models). Safety benchmarks (ClawSafety) show attack success rates ranging from 40% to 75% across models.
  • Time horizon growth: The 50% time horizon (median task completion length) of frontier AI agents has grown exponentially from seconds in 2019 to over 12 hours in early 2026.

Data and Evaluation Paradigm Shifts

  • Stage I (final-output): Table 8 shows MMLU scores up to 94.0%, GSM8K up to 98.1%, MATH up to 90.2% for leading models like GPT-5.4, Claude Opus 4.6.
  • Stage II (process-level): Table 9 shows process verification metrics (ProcessBench, PRMBench) for models like GPT-5, Gemini 2.5 Pro, DeepSeek-R1, with varying performance on step correctness and error identification.
  • Stage III (task closure): Table 10 shows SWE-bench Verified scores up to 80.8%, OSWorld-Verified up to 75.0%, WebArena-Verified up to 67.3% for frontier models like Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro.
  • Stage IV (workspace evaluation): Table 11 shows Claw-Eval scores up to 70.8%, ClawsBench task success rates up to 63.0%, and ClawSafety attack success rates as low as 40.0%.

Theoretical and Practical Implications

Theoretical Implications

  1. Reframing intelligence: The paper argues that reasoning capability in LLMs is not solely a function of model scale but is increasingly realized through inference-time computation and external scaffolding (workspaces, skills, tools). This challenges the view that intelligence is purely encapsulated in neural weights.
  2. Embodied cognition for AI: The "Workspace + Skill" paradigm provides a digital analogue to embodied cognition: an agent's intelligence is shaped by its persistent environment, tools, and reusable procedures. This connects AI research to cognitive science frameworks (e.g., CoALA's mapping of working memory, episodic memory, procedural memory).
  3. Beyond-gradient learning: The paper introduces the concept that learning can occur outside neural parameters—through the evolution of prompts, contexts, harnesses, skills, tests, memories, and governance policies. This complements gradient-based training and suggests a broader definition of "learning" for AI systems.

Practical Implications

  1. System design: The frontier shifts from model-centric improvements to ecosystem engineering. Reliable autonomous agents require persistent workspaces, state management, skill provenance, permission control, repeated-execution reliability, audit trails, and safety enforcement—all infrastructure concerns, not just model concerns.
  2. Evaluation methodology: Evaluation must move from static accuracy metrics to task-closure verification in realistic environments. This requires reproducible initial states, trajectory logs, replayable actions, final-state diffs, and safety checks—demanding substantial infrastructure investment.
  3. Deployment considerations: Production deployment of digital colleagues requires governance mechanisms: permission boundaries, sandboxing, rollback, audit logs, and human-in-the-loop oversight for high-risk actions. The paper identifies data sovereignty, privacy, and enterprise asset boundaries as core architectural requirements.
  4. Skill ecosystem management: As skills become shareable assets, the field must develop skill registries with provenance tracking, dependency management, versioning, security review, and deprecation policies—analogous to mature software ecosystems.

Conclusion

The paper frames the shift from Chatbot to Digital Colleague as a transition from conversational answers to persistent, governed work. This transition proceeds along two dimensions:

  • Cognitively, LLMs advance from next-token "fast thinking" (Chatbot era) to Thinking LLMs that leverage inference-time computation, Chain-of-Thought, reflection, process supervision, and reinforcement learning for more deliberate and reliable cognition.
  • Executionally, they progress from ad hoc tool-calling Agents to OpenClaw-style workstation systems with persistent workspaces, reusable skills, verification loops, and governance.

The "Workspace + Skill" paradigm provides the mechanism for this transition: workspaces supply stateful context for tasks, while skills package reusable procedures for repeatable work. Data and evaluation shift accordingly—from instruction-response pairs and static answer correctness toward state–action–observation trajectories and task-closure verification in sandboxed, auditable environments.

Despite impressive progress, the paper identifies major structural bottlenecks: long-horizon reliability, memory and context management, safety and governance, and human–AI collaboration ethics. The future vision is self-evolving AI ecosystems where models, workspaces, tools, skills, memories, evaluators, and governance mechanisms continuously convert operational experience into validated, reusable assets—enabling AI systems to move beyond reactive chatbots toward adaptive digital colleagues that accumulate experience and improve the environments they inhabit.

Key principle: Every consequential action should be capable of becoming evidence, and every useful piece of evidence should become a governed improvement—validated, versioned, auditable, reversible, and deployed under explicit permission boundaries.

Related papers