W ILD A CTOR: Unconstrained Identity-Preserving Video Generation - Summary

Summary (Overview)

Proposes Actor-18M, a large-scale human video dataset of 1.6M videos with 18M corresponding identity-consistent reference images, designed to overcome viewpoint bias and enable learning of view-invariant human representations.
Introduces W ILD A CTOR, a framework for any-view conditioned human video generation, featuring an Asymmetric Identity-Preserving Attention (AIPA) mechanism to prevent identity leakage/pose-locking and an Identity-Aware 3D RoPE (I-RoPE) for token separation.
Develops a Viewpoint-Adaptive Monte Carlo Sampling strategy that dynamically re-weights reference images during training to encourage complementary viewpoint coverage and balanced manifold learning.
Establishes Actor-Bench, a comprehensive evaluation benchmark, and demonstrates that W ILD A CTOR outperforms existing methods in maintaining full-body identity consistency under challenging viewpoint changes, motions, and long-form narratives.

Introduction and Theoretical Foundation

Production-ready human video generation requires digital actors to maintain strictly consistent identities across shots, viewpoints, and motions—a principle known as "physical permanence" in cinematography. Current diffusion-based video generation models often suffer from identity drift (changing facial/body features) or pose-locking/copy-paste artifacts (subjects appearing rigid). Prior methods are limited by being face-centric (causing "floating head" hallucinations) or relying on naive full-image injection. A critical bottleneck is the lack of large-scale datasets capturing humans under diverse, unconstrained viewpoints and environments. This paper addresses these challenges by curating a novel dataset and proposing a generation framework that decouples identity information from backbone representations to achieve robust, view-invariant human synthesis.

Methodology

1. The Actor-18M Dataset

A large-scale dataset constructed to provide dense, identity-consistent supervision.

Collection & Filtering: 1.6M single-person videos are collected and filtered using facial similarity (Deng et al., 2019) and dense point tracking to ensure subject consistency.
Construction Pipeline: Comprises three subsets:
- Actor-18M-A (View-Aug): Synthesizes view-transformed face/body images from six angles per subject using a multi-angle image editing model to mitigate frontal-view bias and "pose-locking".
- Actor-18M-B (Attr-Aug): Applies attribute-conditioned image editing (environments, lighting, expressions, motions) to diversify backgrounds and styles while preserving identity.
- Actor-18M-C (3-View): Provides canonical three-view (front, side, back) identity anchors generated from high-visibility frames.
Statistics: Key statistics showing the mitigation of frontal bias through augmentation:

Table 1: Detailed statistics of Actor-18M (Abridged).

Subset	Region	Source	Quantity	Viewpoint Distribution (%)
A	Body	Self-Crop	1.64M	F:62.8 / S:36.6 / B:0.6
	Body	View-Aug	8.73M	F:26.2 / S:71.6 / B:2.2
Total	Body	Generated	8.98M	F:27.3 / S:70.5 / B:2.2

2. The W ILD A CTOR Framework

A framework for any-view conditioned human video generation, built on a latent video DiT trained with Rectified Flow (RF). The RF objective is:

L_{\text{RF}} := \mathbb{E}_{t, z_0, \epsilon} \left[ w(t) \| v_\theta(z_t, t, C_{\text{ctx}}) - (\epsilon - z_0) \|_2^2 \right]

where $z_t := (1 - t) z_0 + t \epsilon$ , $z_0$ is the latent video, $\epsilon$ is noise, $t \in [0,1]$ , $w(t)$ is a weighting function, and $C_{\text{ctx}} = \{C_{\text{txt}}, I^f, I^b\}$ aggregates text prompt and reference images.

Core Components:

Asymmetric Identity-Preserving Attention (AIPA): Enforces a unidirectional information flow.
1. Reference-only LoRA: Lightweight LoRA modules are applied exclusively to reference tokens. For reference tokens $c \in \{f_{\text{face}}, f_{\text{body}}\}$ , projections are: $q_c, k_c, v_c = (W_{Q,K,V} + \Delta W^{\text{ref}}_{Q,K,V}) c$ where $\Delta W^{\text{ref}}_{Q,K,V}$ are learnable LoRA parameters. Video tokens use frozen backbone weights.
2. Asymmetric Attention Flow: Video tokens query a unified identity representation $C_{\text{ref}}$ aggregated from reference tokens, but reference tokens do not attend to noisy video latents. Keys and Values are concatenated: $K = [z_t; C_{\text{ref}}]$ , $V = [z_t; C_{\text{ref}}]$ .
Identity-Aware 3D RoPE (I-RoPE): Assigns distinct spatio-temporal coordinates to separate reference tokens from video tokens, preventing ambiguity. Reference tokens are assigned fixed temporal offsets ( $T+\Delta_f$ , $T+\Delta_b$ ) and shifted spatial coordinates starting from $(H_{\text{max}}, W_{\text{max}})$ .
Viewpoint-Adaptive Monte Carlo Sampling: A training strategy that dynamically re-weights reference image sampling probabilities. After sampling a reference $x^*$ , the weights of candidates within its angular neighborhood $|\theta_{x^*} - \theta_{x_j}| < \delta$ are decayed: $w_j \leftarrow w_j \cdot \gamma$ where $\gamma < 1$ is a decay factor. This encourages the model to observe complementary viewpoints.

Empirical Validation / Results

Evaluation Setup: Actor-Bench

Settings: Evaluates 75 subjects across three conditioning settings: canonical three-view, arbitrary viewpoint, and in-the-wild.
Axes: (1) Sequential Narrative: Coherent 3-prompt storylines. (2) Contextual Generalization: Single prompts with diverse environments/viewpoints/motions.
Metrics: Body Consistency (VLM-based), Face Identity Preservation (ArcFace cosine similarity), Semantic Alignment (ViCLIP & VLM-based).

Quantitative Results

Table 2: Quantitative comparisons on Actor-Bench.

Method	Params	Face Identity ↑	Body Consistency ↑	Semantic Alignment (VLM) ↑
Sequential Narrative
T2V → I2V (w/o Ref)	5B	0.320	0.450	0.613
W ILD A CTOR (w Ref)	5B	0.548	0.925	0.893
Contextual Generalization
VACE	14B	0.485	0.582	0.667
Stand-In	14B	0.510	0.416	0.600
Vidu Q2*	–	0.565	0.905	0.880
Kling 1.6*	–	0.558	0.885	0.867
W ILD A CTOR	5B	0.559	0.952	0.920

W ILD A CTOR achieves state-of-the-art body consistency and semantic alignment, outperforming larger open-source and commercial models, especially in challenging viewpoint scenarios.

Ablation Studies

Table 3: Ablation of dataset & sampling strategy (Body Consistency).

Setting	Front ↑	Side ↑	Back ↑	Average ↑
Raw-Crop	0.885	0.725	0.680	0.802
Random Sampling	0.915	0.840	0.785	0.865
Viewpoint-Adaptive	0.958	0.952	0.937	0.952

The proposed dataset and adaptive sampling strategy are crucial for achieving robustness across all viewpoints.

Table 4: Ablation of model components.

Setting	AIPA	I-RoPE	Face ID ↑	Body Cons. ↑	Sem. Align (VLM) ↑
Full-Attn	✗	✓	0.515	0.890	0.610
w/ AIPA only	✓	✗	0.542	0.825	0.895
W ILD A CTOR	✓	✓	0.559	0.952	0.920

Both AIPA (prevents semantic conflict) and I-RoPE (ensures structural coherence) are essential for optimal performance.

Theoretical and Practical Implications

Theoretical: Introduces a principled approach to decoupling static identity information from dynamic scene generation in diffusion models via asymmetric attention flows and token-space separation (I-RoPE). The viewpoint-adaptive sampling strategy provides a method for encouraging balanced coverage of data manifolds.
Practical: Enables the generation of long-form, human-centric narratives with consistent digital actors, which is a critical step towards production-ready video synthesis for applications in film, animation, gaming, and virtual reality. The release of the Actor-18M dataset and Actor-Bench benchmark facilitates further research in identity-preserving generation.

Conclusion

This work addresses the core challenge of identity-consistent human video generation under unconstrained viewpoints and motions. The contributions are threefold: 1) the Actor-18M dataset mitigates viewpoint bias and provides rich supervision; 2) the W ILD A CTOR framework, with its AIPA mechanism and I-RoPE, robustly preserves full-body identity without sacrificing motion quality or prompt adherence; 3) the Actor-Bench evaluation shows superior performance over existing methods. Future directions may include extending the framework to multi-subject interactions and achieving finer-grained control over facial expressions and gestures.