MMSkills: Towards Multimodal Skills for General Visual Agents

Summary (Overview)

  • Introduces Multimodal Skill Packages: A novel representation for reusable skills in visual agents, combining textual procedures, runtime state cards, and multi-view visual evidence into state-conditioned procedural knowledge.
  • Proposes an Automated Skill Generator: An agentic pipeline that transforms public, non-evaluation interaction trajectories into multimodal skill packages through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing.
  • Develops Branch-Loaded Utilization: A runtime mechanism where a temporary branch inspects selected skill evidence, aligns it with the live environment, and returns structured guidance to the main agent, avoiding context pollution and visual over-anchoring.
  • Demonstrates Consistent Performance Gains: Experiments across GUI (OSWorld, macOSWorld) and game (VAB-Minecraft, Super Mario Bros) benchmarks show MMSkills improve both frontier and smaller multimodal agents over no-skill and text-only skill baselines.
  • Shifts Agent Behavior: MMSkills reduce low-level action load, suppress repetitive trajectories, and improve completion awareness, moving agents from exploratory trial-and-error towards grounded, state-aware execution.

Introduction and Theoretical Foundation

Reusable skills are a core abstraction for building capable agents, often stored as textual prompts, executable code, or learned routines. However, for visual agents, procedural knowledge is inherently multimodal. Success depends not only on what operation to perform, but also on recognizing the relevant visual state, interpreting visual evidence of progress or failure, and deciding what to do next. Text-only skills become verbose yet underspecified, while raw demonstrations are lengthy and instance-specific.

This gap motivates the need for multimodal procedural knowledge: reusable guidance that binds action procedures to the visual evidence and state-dependent decisions required for applying them. The paper formalizes three central challenges:

  1. Representation: What should a multimodal skill package contain?
  2. Generation: Where can such packages be derived from public, non-evaluation interaction data?
  3. Utilization: How can an agent consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots?

MMSkills is proposed as a framework to address these challenges, with the core thesis that external multimodal procedural knowledge complements model-internal priors, especially for weaker visual agents.

Methodology

1. Multimodal Skill Package Representation

Each MMSkill is a state-conditioned procedure package represented as:

M=(D,P,S,K)M = (D, P, S, K)

where:

  • DD: Compact descriptor.
  • PP: Reusable textual procedure.
  • S={Sj}j=1mS = \{S_j\}_{j=1}^m: Set of runtime state cards.
  • K={Kj}j=1mK = \{K_j\}_{j=1}^m: Set of keyframe bundles aligned with the state cards.

A runtime state card SjS_j is an agent-facing node that links a point in the procedure to decision rules:

Sj=(when_to_usej,when_not_to_usej,visible_cuesj,verification_cuej,Vj)S_j = (\mathrm{when\_to\_use}_j, \mathrm{when\_not\_to\_use}_j, \mathrm{visible\_cues}_j, \mathrm{verification\_cue}_j, V_j)

where Vj=available_viewsjV_j = \mathrm{available\_views}_j.

Each key state is grounded by a multi-view keyframe bundle. Let V={full_frame,focus_crop,before,after}V = \{\mathrm{full\_frame}, \mathrm{focus\_crop}, \text{before}, \text{after}\}. Then:

Kj={Kjv:vVj,vV}K_j = \{ K^v_j : v \in V_j, v \in V \}

The full_frame preserves global context, focus_crop localizes the visual cue, and optional before/after views expose useful transitions.

2. Skill Generator from Public Trajectories

An automated, meta-skill-guided pipeline G_F transforms a public trajectory pool TdT_d for domain dd into a skill library MdM_d:

GF:TdMdG_F: T_d \mapsto M_d

The pipeline comprises five phases:

TdPhase 0embed+clusterCdPhase 1cluster planAdPhase 2mergeRdPhase 3text draftM~dPhase 4image ground+auditMdT_d \xrightarrow[\text{Phase 0}]{\text{embed+cluster}} C_d \xrightarrow[\text{Phase 1}]{\text{cluster plan}} A_d \xrightarrow[\text{Phase 2}]{\text{merge}} R_d \xrightarrow[\text{Phase 3}]{\text{text draft}} \tilde{M}_d \xrightarrow[\text{Phase 4}]{\text{image ground+audit}} M_d

For a merged skill rRdr \in R_d, finalization is:

M~r=(Dr,Pr,S~r,K~r)ground+auditMr=(Dr,Pr,Sr,Kr)\tilde{M}_r = (D_r, P_r, \tilde{S}_r, \tilde{K}_r) \xrightarrow{\text{ground+audit}} M_r = (D_r, P_r, S_r, K_r)

The generator uses a conservative visual grounding policy, adding views only for state recognition, transition comparison, or completion verification.

3. Branch-Loaded Multimodal Skills Agent

To avoid the pitfalls of direct skill loading (context pressure, visual over-anchoring), MMSkills uses a branch-loading mechanism. The main agent can either act directly or consult a temporary skill branch:

direct:At=πmain(Ot,Ht,CI),branch:Gt=Branch(Ot,Ht,Mt),At=πmain(Ot,Ht,CI,Gt)\text{direct}: A_t = \pi_{\text{main}}(O_t, H_t, C_I), \quad \text{branch}: G_t = \text{Branch}(O_t, H_t, M_t), \quad A_t = \pi_{\text{main}}(O_t, H_t, C_I, G_t)

The branch output is a structured guidance tuple:

Gt=(applicablet,subgoalt,plant,do_not_dot,verifyt)G_t = (\text{applicable}_t, \text{subgoal}_t, \text{plan}_t, \mathrm{do\_not\_do}_t, \text{verify}_t)

The branch operates in two stages:

  1. Gated View Selection: Selects relevant state cards and view types based on the live observation: (Jt,Rt)=SelectViews(Ot,Ht1,Pt,St),Vt={Kjv:jJt,vRt,j}(J_t, R_t) = \text{SelectViews}(O_t, H_{t-1}, P_t, S_t), \quad V_t = \{ K^v_j : j \in J_t, v \in R_{t,j} \}
  2. Branch Planning: Aligns selected evidence with the live state and returns GtG_t: Gt=PlanBranch(Ot,Ht1,Pt,{Sj:jJt},Vt)G_t = \text{PlanBranch}(O_t, H_{t-1}, P_t, \{S_j : j \in J_t\}, V_t)

The main agent uses GtG_t as decision support but chooses the final grounded action from the live screenshot.

Empirical Validation / Results

Experiments were conducted across GUI (OSWorld, macOSWorld) and game (VAB-Minecraft, Super Mario Bros) benchmarks, evaluating frontier and smaller multimodal models under no-skill, text-only skill, and MMSkills conditions.

RQ1: Overall Performance on GUI and Game Tasks

OSWorld Results (Table 1): MMSkills improved overall success rates across all model families.

  • Gemini 3.1 Pro: 44.08% → 50.11%
  • Gemini 3 Flash: 36.65% → 47.97%
  • Qwen3-VL-235B: 21.34% → 39.17%
  • Qwen3-VL-8B-Instruct: 10.78% → 25.40%

Auxiliary Benchmark Results (Table 2): Gains transferred beyond OSWorld.

  • macOSWorld: MMSkills improved completed large-model runs (e.g., Gemini 3 Flash: 55.94% → 65.73%).
  • VAB-Minecraft: Consistent gains in success rate and average score across all models.
  • Super Mario Bros: Higher total performance and reward under MMSkills.

Key Finding: External multimodal procedural knowledge is especially valuable for weaker visual agents, compensating for limited model-internal priors.

RQ2: Ablations of Skill Content and Branch Loading

Skill Package Components (Figure 3A): The complete MMSkills package (with state cards and images) performed best. Removing state cards weakened state discrimination; removing images preserved decision rules but removed visual grounding evidence.

Branch Loading Mechanism (Figure 3B):

  • Direct-full loading hurt performance due to context pollution.
  • Branch loading provided clear gains, even for text-only skills.
  • The full two-stage design (branch loading + view selection) performed best.

RQ3: Skill Usage and Interaction Dynamics (Table 3)

  • Invocation: MMSkills were invoked more often than text-only skills (e.g., Qwen3-VL-235B on OSWorld: 37.50% → 65.28%), suggesting multimodal evidence makes external knowledge easier to recognize as relevant.
  • Trajectory Length: MMSkills reduced average steps per task, while text-only skills sometimes added overhead. This indicates multimodal skills help agents find shorter, more efficient paths.
  • Selected Views: focus_crop views dominated selected visual evidence, with full_frame, before, and after views providing supplementary context when needed.

RQ4: Behavioral Shift Analysis (Figures 4 & 6)

  • Lower Action Load: MMSkills reduced the mean number of low-level primitives (clicks, keyboard actions) per task.
  • Reduced Repetition: Exact repeated actions and the longest same-mode run decreased significantly (e.g., Qwen3-VL-235B exact repeats: 21.8% → 6.2%).
  • Improved Completion Awareness: DONE actions increased, indicating state cards and verification cues help agents decide when a task is complete.

Key Finding: MMSkills reshape agent behavior from exploratory trial-and-error towards grounded, state-aware execution.

Theoretical and Practical Implications

  • Theoretical: Formalizes multimodal procedural knowledge as a crucial component for visual agent competence, bridging the gap between abstract text procedures and instance-specific visual demonstrations.
  • Representation: Introduces a structured, state-conditioned skill package that explicitly binds procedures to visual decision rules (when_to_use, verification_cue) and evidence (state cards, keyframes).
  • Generation: Demonstrates that usable multimodal skills can be automatically mined from public interaction data, not requiring hand-crafted examples or evaluation-task trajectories.
  • Utilization: Branch loading addresses practical LLM limitations (long-context reliability, visual over-anchoring) by isolating evidence inspection from action generation.
  • Practical: Provides a scalable framework to enhance visual agents across diverse domains (GUI, games). The gains for smaller models suggest MMSkills can democratize access to capable visual automation.

Conclusion

MMSkills presents a comprehensive framework for representing, generating, and utilizing reusable multimodal skills for visual agents. By coupling textual procedures with runtime state cards and multi-view visual evidence, and employing a branch-loaded consultation mechanism, MMSkills consistently improve agent performance across benchmarks and model families. The work demonstrates that external multimodal procedural knowledge effectively complements model-internal priors, leading to more efficient, reliable, and state-aware visual agents.

Limitations and Future Work: The approach depends on source-trajectory coverage, may inherit errors from the generation pipeline, and incurs extra inference cost from branch loading. Future directions include extending MMSkills to broader embodied or safety-critical settings, which will require stronger verification and online skill repair mechanisms.