ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

Summary (Overview)

Core Contribution: Proposes ExoActor, a novel framework that leverages large-scale third-person (exocentric) video generation models as a unified interface for modeling and executing interaction-rich humanoid behaviors. It bypasses the need for task-specific robot data collection.
Key Insight: High-level task instructions and scene context are used to generate plausible "imagined demonstration" videos. These videos implicitly encode coordinated interactions between robot, environment, and objects.
End-to-End Pipeline: The framework consists of three stages: 1) Video Generation (with robot-to-human embodiment transfer), 2) Motion Estimation (extracting 3D human kinematics and hand poses), and 3) Motion Execution (using a general motion tracking controller).
Demonstrated Feasibility: The implemented system shows the approach can generate and execute diverse tasks (navigation, coarse interaction, fine manipulation) on a real Unitree G1 humanoid robot in new scenarios without additional real-world training.

Introduction and Theoretical Foundation

Enabling fluent, interaction-rich behaviors for humanoid robots operating in unstructured environments remains a fundamental challenge. Current systems struggle to jointly model spatial context, temporal dynamics, robot actions, and task intent at scale, often failing to generalize beyond controlled training settings.

ExoActor addresses this by bridging generative video modeling and humanoid control. The theoretical foundation is based on the observation that large-scale video generation models (trained on vast human-centric data) exhibit strong generalization capabilities for visual dynamics. ExoActor repurposes these models to synthesize third-person "action plans" that serve as high-level, interaction-aware behavioral priors.

The framework decouples high-level interaction modeling (handled by the video model) from low-level control (handled by a motion tracker). This allows the system to leverage the implicit knowledge and generalization power of pretrained video models while remaining compatible with established control frameworks, eliminating the need for expensive task-specific robot data collection.

Methodology

The ExoActor pipeline is a three-stage process that converts a task instruction and initial observation into executable robot behaviors.

2.1 Overall Design

Video Generation: Synthesize task-consistent third-person videos depicting the execution process.
Motion Estimation: Convert generated videos into structured 3D human motion representations.
Motion Execution: Translate estimated motions into dynamically feasible robot control commands.

2.2 Third-Person Video-Action Generation

This stage faces an embodiment mismatch: video models are trained on human data, not robots. The solution is a two-stage process:

Robot-to-Human Embodiment Transfer: The initial robot observation is transformed into a human-like representation using a prompt-based image editor (e.g., Gemini 3.1 Pro). This step strictly preserves the original scene layout, camera viewpoint, body pose, orientation, and scale (see Fig. 3 & Appendix prompt in Fig. 12).
Task-to-Action Decomposition & Generation:
- Decomposition: A high-level instruction $G$ is decomposed into a temporally ordered action chain $C = \\{a_1, a_2, ..., a_T\\}$ using GPT-5.4 (prompt in Fig. 11). Example: "Pick up the box" → approach → bend down → grasp → lift → stand.
- Prompt Construction: The action chain is grounded in the initial observation to create a scene-aware description (prompt in Fig. 13).
- Video Generation: Using a structured prompt template (see Appendix Fig. 14 & 15), an off-the-shelf model (primarily Kling) generates the final video. The template enforces fixed camera views, scene consistency, and natural motion.

2.3 Interaction-Aware Motion Estimation

The goal is to recover executable kinematics from the generated pixel-level videos.

Whole-body Motion Estimation: Uses GENMO, a diffusion-based model, to estimate 3D human motion from monocular video. The output is a sequence of SMPL parameters: $M = \\{q_t, p_t\\}_{t=1}^T$ , where $q_t$ are joint rotations and $p_t$ is the global position.
Hand Motion Estimation: Uses WiLoR frame-by-frame to estimate bilateral hand poses $H = \\{h_t^l, h_t^r\\}_{t=1}^T$ and discrete interaction states $S = \\{s_t^l, s_t^r\\}_{t=1}^T$ , where $s_t^{l/r} \\in \\{0,1,2\\}$ for open, half-open, closed.
The final interaction-aware motion representation is: $\tilde{M} = \\{q_t, p_t, h_t^l, h_t^r, s_t^l, s_t^r\\}_{t=1}^T$

2.4 General Motion Tracking Deployment

The estimated motions lack dynamics awareness. A motion tracking controller SONIC is used to "physics-filter" the kinematic references. The policy $\pi$ takes the current robot state $s_t$ and a window of reference motions $\hat{q}_{t:t+k}$ to produce stable, feasible controls.

For hands, estimated states are mapped to the robot's 7-DoF joint targets (Dex3-1 compatible). The system does not use retargeting, prioritizing spatial accuracy over smoothness.

Empirical Validation / Results

Tasks are categorized by difficulty: B (Easy) - basic navigation; A (Moderate) - navigation + coarse interaction; S (Challenging) - fine-grained, multi-step manipulation.

3.2 Case Studies

Success Cases: The system successfully executed tasks across all levels (Figs. 6, 7, 8), including approaching objects, sitting, lifting boxes, and placing bottles into baskets. For S-tier tasks, small supporting bases were sometimes needed to compensate for hand height inaccuracies.
Failure Modes:
- Video Generation: Hallucination of objects/scale (Fig. 9), inconsistent actions, implausible physics.
- Motion Estimation: Inaccuracies from occlusion (Fig. 10), unreliable rear viewpoints, incorrect wrist orientation (e.g., vertical grasp estimated as horizontal).
- Execution: Mismatches in hand height or movement distance due to physical constraints.

3.2.3 Ablation Studies

Video Models: Kling 3 was chosen over Veo 3.1 and Wan 2.6 due to superior coherence and physical plausibility.
Retargeting: Omitting retargeting (GMR, OmniRetarget) preserved better spatial alignment for navigation/manipulation, despite some noise.
Camera Viewpoint: Back-to-front views better for navigation; front-facing views better for manipulation.
Motion Estimation: GENMO was chosen over CRISP for comparable quality with higher efficiency and stability.

3.2.4 Latency Analysis

The pipeline is offline. Key module runtimes are summarized below:

Module	Metric	Avg. Time (s)
Robot-to-Human Embodiment Transfer	per request	10.7
Task-to-Action Decomposition & Prompt Construction	per request	2.5
Task- & Environment-Generalizable Video Generation	per sec (video)	13.2
Whole-body Motion Estimation	per sec (video)	2.9
Hand Motion Estimation	per sec (video)	16.4

Table 1: Average runtime of different components in the ExoActor pipeline. Video generation and hand motion estimation are the primary bottlenecks.

Theoretical and Practical Implications

Scalable Behavior Synthesis: ExoActor demonstrates a new paradigm for generating humanoid behaviors by leveraging the generalization of internet-scale video models, reducing reliance on robot-specific data.
Modular and Extensible Design: The decoupled pipeline allows independent improvement of video generation, motion estimation, or control components.
Highlighting Critical Challenges: The work exposes systemic limitations that must be addressed for robust real-world application:
1. Physical Realism in Generation: Video models prioritize visual plausibility over physical executability, creating a fundamental bottleneck.
2. Video-to-Motion Translation: Noisy motion estimation, especially for fine-grained wrist poses, propagates errors to execution.
3. Open-Loop Execution: The current offline, open-loop pipeline lacks adaptability to dynamic environments or execution errors.

Conclusion

ExoActor presents an end-to-end framework that uses exocentric video generation as an interface for modeling and executing interaction-rich humanoid control. It shows the feasibility of translating "imagined" video demonstrations into real-world robot behaviors without task-specific data.

Future Directions are critical for advancement:

Closed-loop, scene-aware control integrating the motion reference with online perception.
Physically realistic video generation with stronger causality and kinematic priors.
Streaming, real-time imagination and execution for adaptability.
First-person to third-person generation to relax the need for external cameras.
Improved video-to-motion translation, especially for wrist poses.
Robot-centric video generation models that natively understand robotic embodiments.
Standardized benchmarks for evaluating the full video-driven control pipeline.

ExoActor provides a promising step towards scalable humanoid systems that use generative models to imagine, structure, and execute complex physical interactions.