W ILD A CTOR: Unconstrained Identity-Preserving Video Generation - Summary

Summary (Overview)

  • Proposes Actor-18M, a large-scale human video dataset of 1.6M videos with 18M corresponding identity-consistent reference images, designed to overcome viewpoint bias and enable learning of view-invariant human representations.
  • Introduces W ILD A CTOR, a framework for any-view conditioned human video generation, featuring an Asymmetric Identity-Preserving Attention (AIPA) mechanism to prevent identity leakage/pose-locking and an Identity-Aware 3D RoPE (I-RoPE) for token separation.
  • Develops a Viewpoint-Adaptive Monte Carlo Sampling strategy that dynamically re-weights reference images during training to encourage complementary viewpoint coverage and balanced manifold learning.
  • Establishes Actor-Bench, a comprehensive evaluation benchmark, and demonstrates that W ILD A CTOR outperforms existing methods in maintaining full-body identity consistency under challenging viewpoint changes, motions, and long-form narratives.

Introduction and Theoretical Foundation

Production-ready human video generation requires digital actors to maintain strictly consistent identities across shots, viewpoints, and motions—a principle known as "physical permanence" in cinematography. Current diffusion-based video generation models often suffer from identity drift (changing facial/body features) or pose-locking/copy-paste artifacts (subjects appearing rigid). Prior methods are limited by being face-centric (causing "floating head" hallucinations) or relying on naive full-image injection. A critical bottleneck is the lack of large-scale datasets capturing humans under diverse, unconstrained viewpoints and environments. This paper addresses these challenges by curating a novel dataset and proposing a generation framework that decouples identity information from backbone representations to achieve robust, view-invariant human synthesis.

Methodology

1. The Actor-18M Dataset

A large-scale dataset constructed to provide dense, identity-consistent supervision.

  • Collection & Filtering: 1.6M single-person videos are collected and filtered using facial similarity (Deng et al., 2019) and dense point tracking to ensure subject consistency.
  • Construction Pipeline: Comprises three subsets:
    • Actor-18M-A (View-Aug): Synthesizes view-transformed face/body images from six angles per subject using a multi-angle image editing model to mitigate frontal-view bias and "pose-locking".
    • Actor-18M-B (Attr-Aug): Applies attribute-conditioned image editing (environments, lighting, expressions, motions) to diversify backgrounds and styles while preserving identity.
    • Actor-18M-C (3-View): Provides canonical three-view (front, side, back) identity anchors generated from high-visibility frames.
  • Statistics: Key statistics showing the mitigation of frontal bias through augmentation:

Table 1: Detailed statistics of Actor-18M (Abridged).

SubsetRegionSourceQuantityViewpoint Distribution (%)
ABodySelf-Crop1.64MF:62.8 / S:36.6 / B:0.6
BodyView-Aug8.73MF:26.2 / S:71.6 / B:2.2
TotalBodyGenerated8.98MF:27.3 / S:70.5 / B:2.2

2. The W ILD A CTOR Framework

A framework for any-view conditioned human video generation, built on a latent video DiT trained with Rectified Flow (RF). The RF objective is:

LRF:=Et,z0,ϵ[w(t)vθ(zt,t,Cctx)(ϵz0)22]L_{\text{RF}} := \mathbb{E}_{t, z_0, \epsilon} \left[ w(t) \| v_\theta(z_t, t, C_{\text{ctx}}) - (\epsilon - z_0) \|_2^2 \right]

where zt:=(1t)z0+tϵz_t := (1 - t) z_0 + t \epsilon, z0z_0 is the latent video, ϵ\epsilon is noise, t[0,1]t \in [0,1], w(t)w(t) is a weighting function, and Cctx={Ctxt,If,Ib}C_{\text{ctx}} = \{C_{\text{txt}}, I^f, I^b\} aggregates text prompt and reference images.

Core Components:

  • Asymmetric Identity-Preserving Attention (AIPA): Enforces a unidirectional information flow.
    1. Reference-only LoRA: Lightweight LoRA modules are applied exclusively to reference tokens. For reference tokens c{fface,fbody}c \in \{f_{\text{face}}, f_{\text{body}}\}, projections are: qc,kc,vc=(WQ,K,V+ΔWQ,K,Vref)cq_c, k_c, v_c = (W_{Q,K,V} + \Delta W^{\text{ref}}_{Q,K,V}) c where ΔWQ,K,Vref\Delta W^{\text{ref}}_{Q,K,V} are learnable LoRA parameters. Video tokens use frozen backbone weights.
    2. Asymmetric Attention Flow: Video tokens query a unified identity representation CrefC_{\text{ref}} aggregated from reference tokens, but reference tokens do not attend to noisy video latents. Keys and Values are concatenated: K=[zt;Cref]K = [z_t; C_{\text{ref}}], V=[zt;Cref]V = [z_t; C_{\text{ref}}].
  • Identity-Aware 3D RoPE (I-RoPE): Assigns distinct spatio-temporal coordinates to separate reference tokens from video tokens, preventing ambiguity. Reference tokens are assigned fixed temporal offsets (T+ΔfT+\Delta_f, T+ΔbT+\Delta_b) and shifted spatial coordinates starting from (Hmax,Wmax)(H_{\text{max}}, W_{\text{max}}).
  • Viewpoint-Adaptive Monte Carlo Sampling: A training strategy that dynamically re-weights reference image sampling probabilities. After sampling a reference xx^*, the weights of candidates within its angular neighborhood θxθxj<δ|\theta_{x^*} - \theta_{x_j}| < \delta are decayed: wjwjγw_j \leftarrow w_j \cdot \gamma where γ<1\gamma < 1 is a decay factor. This encourages the model to observe complementary viewpoints.

Empirical Validation / Results

Evaluation Setup: Actor-Bench

  • Settings: Evaluates 75 subjects across three conditioning settings: canonical three-view, arbitrary viewpoint, and in-the-wild.
  • Axes: (1) Sequential Narrative: Coherent 3-prompt storylines. (2) Contextual Generalization: Single prompts with diverse environments/viewpoints/motions.
  • Metrics: Body Consistency (VLM-based), Face Identity Preservation (ArcFace cosine similarity), Semantic Alignment (ViCLIP & VLM-based).

Quantitative Results

Table 2: Quantitative comparisons on Actor-Bench.

MethodParamsFace Identity ↑Body Consistency ↑Semantic Alignment (VLM) ↑
Sequential Narrative
T2V → I2V (w/o Ref)5B0.3200.4500.613
W ILD A CTOR (w Ref)5B0.5480.9250.893
Contextual Generalization
VACE14B0.4850.5820.667
Stand-In14B0.5100.4160.600
Vidu Q2*0.5650.9050.880
Kling 1.6*0.5580.8850.867
W ILD A CTOR5B0.5590.9520.920

W ILD A CTOR achieves state-of-the-art body consistency and semantic alignment, outperforming larger open-source and commercial models, especially in challenging viewpoint scenarios.

Ablation Studies

Table 3: Ablation of dataset & sampling strategy (Body Consistency).

SettingFront ↑Side ↑Back ↑Average ↑
Raw-Crop0.8850.7250.6800.802
Random Sampling0.9150.8400.7850.865
Viewpoint-Adaptive0.9580.9520.9370.952

The proposed dataset and adaptive sampling strategy are crucial for achieving robustness across all viewpoints.

Table 4: Ablation of model components.

SettingAIPAI-RoPEFace ID ↑Body Cons. ↑Sem. Align (VLM) ↑
Full-Attn0.5150.8900.610
w/ AIPA only0.5420.8250.895
W ILD A CTOR0.5590.9520.920

Both AIPA (prevents semantic conflict) and I-RoPE (ensures structural coherence) are essential for optimal performance.

Theoretical and Practical Implications

  • Theoretical: Introduces a principled approach to decoupling static identity information from dynamic scene generation in diffusion models via asymmetric attention flows and token-space separation (I-RoPE). The viewpoint-adaptive sampling strategy provides a method for encouraging balanced coverage of data manifolds.
  • Practical: Enables the generation of long-form, human-centric narratives with consistent digital actors, which is a critical step towards production-ready video synthesis for applications in film, animation, gaming, and virtual reality. The release of the Actor-18M dataset and Actor-Bench benchmark facilitates further research in identity-preserving generation.

Conclusion

This work addresses the core challenge of identity-consistent human video generation under unconstrained viewpoints and motions. The contributions are threefold: 1) the Actor-18M dataset mitigates viewpoint bias and provides rich supervision; 2) the W ILD A CTOR framework, with its AIPA mechanism and I-RoPE, robustly preserves full-body identity without sacrificing motion quality or prompt adherence; 3) the Actor-Bench evaluation shows superior performance over existing methods. Future directions may include extending the framework to multi-subject interactions and achieving finer-grained control over facial expressions and gestures.