MetaClaw: A Continual Meta-Learning Framework for Deployed LLM Agents

Summary (Overview)

Problem: Deployed LLM agents are typically static, becoming misaligned as user task distributions evolve over time. Existing adaptation methods (memory-based, skill-based, RL-based) operate in isolation and fail to handle continuous, real-world deployment constraints like service downtime and stale reward signals.
Core Solution: MetaClaw, a continual meta-learning framework that unifies skill-driven fast adaptation (gradient-free, immediate) with opportunistic policy optimization (gradient-based, deferred) to enable agents to learn and evolve during normal usage.
Key Mechanisms:
- Skill-Driven Fast Adaptation: An LLM "evolver" analyzes failure trajectories to synthesize new behavioral skill instructions, which are injected into the prompt immediately with zero service downtime.
- Opportunistic Policy Optimization: Reinforcement Learning (RL) with a Process Reward Model (PRM) updates model weights via Cloud LoRA fine-tuning, triggered only during user-inactive windows by the Opportunistic Meta-Learning Scheduler (OMLS).
- Skill Generation Versioning: Enforces a strict separation between support data (failures for skill evolution) and query data (post-adaptation trajectories for RL) to prevent stale reward contamination.
Main Results:
- On the MetaClaw-Bench (934 questions, 44 simulated workdays), skill-driven adaptation improved accuracy by up to 32.2% relative. The full MetaClaw pipeline advanced Kimi-K2.5 from 21.4% to 40.6% accuracy (vs. GPT-5.2 baseline of 41.1%) and achieved an 8.25× gain in end-to-end task completion.
- On AutoResearchClaw (23-stage autonomous research pipeline), skill injection alone improved the composite robustness score by 18.3%, demonstrating cross-domain generalization.

Introduction and Theoretical Foundation

Large Language Model (LLM) agents excel at complex tasks but face a critical deployment challenge: they are typically trained once and remain static, while real-world user needs and task distributions evolve continuously (e.g., on platforms like OpenClaw). This leads to performance degradation on tasks underrepresented during initial training.

Existing adaptation approaches have significant limitations:

Memory-based methods store raw conversation trajectories but fail to distill transferable behavioral knowledge.
Skill-based methods compress experience into reusable instructions but treat the skill library as a static database, disconnected from weight optimization.
RL-based methods update model weights but often operate offline and ignore the data validity problem: trajectories collected under an old behavioral context provide stale rewards that contaminate gradient updates.

Key Insight: Two complementary timescales of adaptation exist and are mutually reinforcing. Fast adaptation (seconds) can distill behavioral heuristics from failures into prompt-based skills. Slow adaptation (minutes/hours) can improve the underlying policy via gradient-based optimization. A better policy yields more informative failures for skill synthesis, and richer skills produce higher-reward trajectories for policy optimization.

MetaClaw is framed as a continual meta-learning problem. The agent's behavior is defined by a meta-model:

M = (\theta, S)

where $\theta$ represents the parameters of the base LLM policy and $S = \{s_1, s_2, ..., s_K\}$ is a library of skill instructions. Given a task $\tau$ , actions are generated as:

a \sim \pi_\theta(\cdot | \tau, \text{Retrieve}(S, \tau))

The goal is to continuously improve $M$ over a non-stationary task stream, learning to adapt better to new tasks as they arrive.

Methodology

MetaClaw's architecture improves the meta-model $M$ through two coordinated loops (see Algorithm 1).

1. Skill-Driven Fast Adaptation

This is a gradient-free process that evolves the skill library $S$ . When failure trajectories accumulate in the support set $D_g^{\text{sup}}$ for skill generation $g$ , a skill evolver LLM $E$ synthesizes new instructions:

S_{g+1} = S_g \cup E(S_g, D_g^{\text{sup}})

The skill index $g$ is incremented. New skills are instantly available via prompt injection for subsequent tasks, requiring no parameter updates and causing zero downtime. The skill library serves a dual role: as a meta-parameter accumulating long-term knowledge and as an adaptation basis for instant task specialization.

2. Opportunistic Policy Optimization

This gradient-based process updates the policy parameters $\theta$ using RL. It is deferred and triggered by the Opportunistic Meta-Learning Scheduler (OMLS), which monitors three idle signals:

Sleep Window: User-configured hours.
System Inactivity: No keyboard/mouse input for a threshold period.
Calendar Occupancy: Google Calendar event detection.

Policy optimization updates $\theta$ over the query data buffer $B$ containing post-adaptation trajectories:

\theta_{t+1} = \theta_t + \alpha \nabla_\theta \mathbb{E}_{(\tau,\xi,g')\sim B}[R(\pi_\theta(\cdot | \tau, S_{g'}))]

where $g' \le g^*$ is the skill generation under which the trajectory was collected, and $R$ is a Process Reward Model (PRM) score. Updates are performed via Cloud LoRA fine-tuning.

3. Skill Generation Versioning

A critical mechanism to maintain the support-query separation. Each collected trajectory is stamped with its skill generation index $g_i$ .

Support Data ( $D_g^{\text{sup}}$ ): Trajectories collected under $S_g$ that trigger skill evolution to $S_{g+1}$ . Used by the skill evolver and discarded from the RL buffer.
Query Data ( $D_{g+1}^{\text{qry}}$ ): Trajectories collected after $S_{g+1}$ takes effect. Only these are used for policy optimization.

When the skill generation advances from $g$ to $g+1$ , all samples with version $\le g$ are flushed from the training buffer, preventing optimization against stale rewards.

Empirical Validation / Results

Experimental Setup

Benchmarks:
- MetaClaw-Bench: A new continual agentic benchmark with 934 questions over 44 simulated workdays, split into Part I (execution-heavy file-check tasks) and Part II (rule-based transformation tasks).
- AutoResearchClaw: A 23-stage autonomous research pipeline for open-ended, long-horizon evaluation.
Models & Conditions: Evaluated on GPT-5.2 and Kimi-K2.5 under three conditions:
1. Baseline: Base model with no adaptation.
2. MetaClaw (Skills): Skill-driven fast adaptation only.
3. MetaClaw (Full): Full pipeline (skills + RL), evaluated for Kimi-K2.5 only.

Key Results

Table 1: Main results on MetaClaw-Bench

Part	Model	Condition	Acc. (%)	Compl. (%)
I	GPT-5.2	Baseline	41.1	14.7
		MetaClaw (Skills)	44.0	17.1
	Kimi-K2.5	Baseline	21.4	2.0
		MetaClaw (Skills)	28.3	2.0
		MetaClaw (Full)	40.6	16.5
II	GPT-5.2	Baseline	44.9	58.4
		MetaClaw (Skills)	49.1	67.5
	Kimi-K2.5	Baseline	21.1	18.2
		MetaClaw (Skills)	26.9	33.8
		MetaClaw (Full)	39.6	51.9

Skill-driven adaptation provided consistent gains for both models (e.g., +32.2% relative accuracy for Kimi-K2.5 on Part I).
MetaClaw (Full) delivered the largest gains, nearly closing the performance gap between Kimi-K2.5 and the baseline GPT-5.2, with an 8.25× improvement in end-to-end task completion on Part I.
Skills alone improved multi-choice reasoning but not complex file-check execution; weight-level RL updates were necessary for reliable end-to-end completion.

Table 2: MetaClaw (Skills-Only) on AutoResearchClaw

Metric	Baseline	+ MetaClaw (Skills)	Relative Change
Stage retry rate (↓)	10.5%	7.9%	↓ 24.8%
Refine cycle count (↓)	2.0	1.2	↓ 40.0%
Pipeline stage completion (↑)	18 / 19	19 / 19	↑ 5.3%
Composite robustness score (↑)	0.714	0.845	↑ 18.3%

Demonstrates cross-domain generalization; skill injection improved robustness in a complex, open-ended pipeline without any gradient updates.

Analysis

Per-Day Trends: MetaClaw's advantage was most pronounced in mid-difficulty tasks (days 11-22), where procedural knowledge learned from failures was most applicable.
Task-Type Breakdown: Skills improved multi-choice accuracy, while RL was crucial for boosting file-check completion rates.
Skill Library Analysis: Evolved skills clustered around cross-cutting behavioral heuristics (e.g., temporal format compliance, backup-before-modify protocols), explaining their generalizability.
Case Studies: Illustrated the complementary roles: a single distilled skill could resolve a compliance error instantly, while complex execution reliability required weight-level RL updates.

Theoretical and Practical Implications

Theoretical: MetaClaw formalizes continual adaptation for LLM agents as a meta-learning problem with explicit support-query separation. It bridges discrete, prompt-based adaptation with continuous, parameter-based optimization.
Practical:
- Zero-Downtime Deployment: Skill injection allows immediate improvement without interrupting service.
- Resource Efficiency: The proxy-based architecture and opportunistic scheduling enable evolution without requiring local GPUs.
- Compensating for Model Capability: The framework can help less-capable models (e.g., Kimi-K2.5) approach the performance of frontier models (e.g., GPT-5.2) through continuous learning.
- General Applicability: The mechanism generalizes from structured CLI tasks to open-ended, multi-stage agentic workflows.

Conclusion

MetaClaw presents a foundational framework for creating LLM agents that genuinely learn and evolve through real-world use. By unifying fast skill adaptation and slow policy optimization within a continual meta-learning paradigm, it addresses key deployment challenges: non-stationary task distributions, service downtime, and stale reward contamination. Evaluations demonstrate consistent performance improvements and effective cross-domain generalization.

Future Directions: The current idle-window detection relies on user configuration, which may not generalize to all environments. Future work could explore more autonomous scheduling and further integrate the skill and parameter evolution loops. MetaClaw establishes a principled path toward more adaptive and resilient personal AI assistants.