W ILD A CTOR: Unconstrained Identity-Preserving Video Generation - Summary
Summary (Overview)
- Proposes Actor-18M, a large-scale human video dataset of 1.6M videos with 18M corresponding identity-consistent reference images, designed to overcome viewpoint bias and enable learning of view-invariant human representations.
- Introduces W ILD A CTOR, a framework for any-view conditioned human video generation, featuring an Asymmetric Identity-Preserving Attention (AIPA) mechanism to prevent identity leakage/pose-locking and an Identity-Aware 3D RoPE (I-RoPE) for token separation.
- Develops a Viewpoint-Adaptive Monte Carlo Sampling strategy that dynamically re-weights reference images during training to encourage complementary viewpoint coverage and balanced manifold learning.
- Establishes Actor-Bench, a comprehensive evaluation benchmark, and demonstrates that W ILD A CTOR outperforms existing methods in maintaining full-body identity consistency under challenging viewpoint changes, motions, and long-form narratives.
Introduction and Theoretical Foundation
Production-ready human video generation requires digital actors to maintain strictly consistent identities across shots, viewpoints, and motions—a principle known as "physical permanence" in cinematography. Current diffusion-based video generation models often suffer from identity drift (changing facial/body features) or pose-locking/copy-paste artifacts (subjects appearing rigid). Prior methods are limited by being face-centric (causing "floating head" hallucinations) or relying on naive full-image injection. A critical bottleneck is the lack of large-scale datasets capturing humans under diverse, unconstrained viewpoints and environments. This paper addresses these challenges by curating a novel dataset and proposing a generation framework that decouples identity information from backbone representations to achieve robust, view-invariant human synthesis.
Methodology
1. The Actor-18M Dataset
A large-scale dataset constructed to provide dense, identity-consistent supervision.
- Collection & Filtering: 1.6M single-person videos are collected and filtered using facial similarity (Deng et al., 2019) and dense point tracking to ensure subject consistency.
- Construction Pipeline: Comprises three subsets:
- Actor-18M-A (View-Aug): Synthesizes view-transformed face/body images from six angles per subject using a multi-angle image editing model to mitigate frontal-view bias and "pose-locking".
- Actor-18M-B (Attr-Aug): Applies attribute-conditioned image editing (environments, lighting, expressions, motions) to diversify backgrounds and styles while preserving identity.
- Actor-18M-C (3-View): Provides canonical three-view (front, side, back) identity anchors generated from high-visibility frames.
- Statistics: Key statistics showing the mitigation of frontal bias through augmentation:
Table 1: Detailed statistics of Actor-18M (Abridged).
| Subset | Region | Source | Quantity | Viewpoint Distribution (%) |
|---|---|---|---|---|
| A | Body | Self-Crop | 1.64M | F:62.8 / S:36.6 / B:0.6 |
| Body | View-Aug | 8.73M | F:26.2 / S:71.6 / B:2.2 | |
| Total | Body | Generated | 8.98M | F:27.3 / S:70.5 / B:2.2 |
2. The W ILD A CTOR Framework
A framework for any-view conditioned human video generation, built on a latent video DiT trained with Rectified Flow (RF). The RF objective is:
where , is the latent video, is noise, , is a weighting function, and aggregates text prompt and reference images.
Core Components:
- Asymmetric Identity-Preserving Attention (AIPA): Enforces a unidirectional information flow.
- Reference-only LoRA: Lightweight LoRA modules are applied exclusively to reference tokens. For reference tokens , projections are: where are learnable LoRA parameters. Video tokens use frozen backbone weights.
- Asymmetric Attention Flow: Video tokens query a unified identity representation aggregated from reference tokens, but reference tokens do not attend to noisy video latents. Keys and Values are concatenated: , .
- Identity-Aware 3D RoPE (I-RoPE): Assigns distinct spatio-temporal coordinates to separate reference tokens from video tokens, preventing ambiguity. Reference tokens are assigned fixed temporal offsets (, ) and shifted spatial coordinates starting from .
- Viewpoint-Adaptive Monte Carlo Sampling: A training strategy that dynamically re-weights reference image sampling probabilities. After sampling a reference , the weights of candidates within its angular neighborhood are decayed: where is a decay factor. This encourages the model to observe complementary viewpoints.
Empirical Validation / Results
Evaluation Setup: Actor-Bench
- Settings: Evaluates 75 subjects across three conditioning settings: canonical three-view, arbitrary viewpoint, and in-the-wild.
- Axes: (1) Sequential Narrative: Coherent 3-prompt storylines. (2) Contextual Generalization: Single prompts with diverse environments/viewpoints/motions.
- Metrics: Body Consistency (VLM-based), Face Identity Preservation (ArcFace cosine similarity), Semantic Alignment (ViCLIP & VLM-based).
Quantitative Results
Table 2: Quantitative comparisons on Actor-Bench.
| Method | Params | Face Identity ↑ | Body Consistency ↑ | Semantic Alignment (VLM) ↑ |
|---|---|---|---|---|
| Sequential Narrative | ||||
| T2V → I2V (w/o Ref) | 5B | 0.320 | 0.450 | 0.613 |
| W ILD A CTOR (w Ref) | 5B | 0.548 | 0.925 | 0.893 |
| Contextual Generalization | ||||
| VACE | 14B | 0.485 | 0.582 | 0.667 |
| Stand-In | 14B | 0.510 | 0.416 | 0.600 |
| Vidu Q2* | – | 0.565 | 0.905 | 0.880 |
| Kling 1.6* | – | 0.558 | 0.885 | 0.867 |
| W ILD A CTOR | 5B | 0.559 | 0.952 | 0.920 |
W ILD A CTOR achieves state-of-the-art body consistency and semantic alignment, outperforming larger open-source and commercial models, especially in challenging viewpoint scenarios.
Ablation Studies
Table 3: Ablation of dataset & sampling strategy (Body Consistency).
| Setting | Front ↑ | Side ↑ | Back ↑ | Average ↑ |
|---|---|---|---|---|
| Raw-Crop | 0.885 | 0.725 | 0.680 | 0.802 |
| Random Sampling | 0.915 | 0.840 | 0.785 | 0.865 |
| Viewpoint-Adaptive | 0.958 | 0.952 | 0.937 | 0.952 |
The proposed dataset and adaptive sampling strategy are crucial for achieving robustness across all viewpoints.
Table 4: Ablation of model components.
| Setting | AIPA | I-RoPE | Face ID ↑ | Body Cons. ↑ | Sem. Align (VLM) ↑ |
|---|---|---|---|---|---|
| Full-Attn | ✗ | ✓ | 0.515 | 0.890 | 0.610 |
| w/ AIPA only | ✓ | ✗ | 0.542 | 0.825 | 0.895 |
| W ILD A CTOR | ✓ | ✓ | 0.559 | 0.952 | 0.920 |
Both AIPA (prevents semantic conflict) and I-RoPE (ensures structural coherence) are essential for optimal performance.
Theoretical and Practical Implications
- Theoretical: Introduces a principled approach to decoupling static identity information from dynamic scene generation in diffusion models via asymmetric attention flows and token-space separation (I-RoPE). The viewpoint-adaptive sampling strategy provides a method for encouraging balanced coverage of data manifolds.
- Practical: Enables the generation of long-form, human-centric narratives with consistent digital actors, which is a critical step towards production-ready video synthesis for applications in film, animation, gaming, and virtual reality. The release of the Actor-18M dataset and Actor-Bench benchmark facilitates further research in identity-preserving generation.
Conclusion
This work addresses the core challenge of identity-consistent human video generation under unconstrained viewpoints and motions. The contributions are threefold: 1) the Actor-18M dataset mitigates viewpoint bias and provides rich supervision; 2) the W ILD A CTOR framework, with its AIPA mechanism and I-RoPE, robustly preserves full-body identity without sacrificing motion quality or prompt adherence; 3) the Actor-Bench evaluation shows superior performance over existing methods. Future directions may include extending the framework to multi-subject interactions and achieving finer-grained control over facial expressions and gestures.