# WildActor: Unconstrained Identity-Preserving Video Generation

> WILD ACTOR introduces a framework for identity-preserving human video generation using a novel dataset and asymmetric attention to prevent identity drift across viewpoints.

- **Source:** [arXiv](https://arxiv.org/abs/2603.00586)
- **Published:** 2026-03-10
- **Permalink:** https://picx.dev/p/88HQe9
- **Whiteboard:** https://picx.dev/p/88HQe9/image

## Summary

# W ILD A CTOR: Unconstrained Identity-Preserving Video Generation - Summary

## Summary (Overview)
*   **Proposes Actor-18M**, a large-scale human video dataset of 1.6M videos with 18M corresponding identity-consistent reference images, designed to overcome viewpoint bias and enable learning of view-invariant human representations.
*   **Introduces W ILD A CTOR**, a framework for any-view conditioned human video generation, featuring an **Asymmetric Identity-Preserving Attention (AIPA)** mechanism to prevent identity leakage/pose-locking and an **Identity-Aware 3D RoPE (I-RoPE)** for token separation.
*   **Develops a Viewpoint-Adaptive Monte Carlo Sampling** strategy that dynamically re-weights reference images during training to encourage complementary viewpoint coverage and balanced manifold learning.
*   **Establishes Actor-Bench**, a comprehensive evaluation benchmark, and demonstrates that W ILD A CTOR outperforms existing methods in maintaining full-body identity consistency under challenging viewpoint changes, motions, and long-form narratives.

## Introduction and Theoretical Foundation
Production-ready human video generation requires digital actors to maintain strictly consistent identities across shots, viewpoints, and motions—a principle known as "physical permanence" in cinematography. Current diffusion-based video generation models often suffer from **identity drift** (changing facial/body features) or **pose-locking/copy-paste artifacts** (subjects appearing rigid). Prior methods are limited by being face-centric (causing "floating head" hallucinations) or relying on naive full-image injection. A critical bottleneck is the lack of large-scale datasets capturing humans under diverse, unconstrained viewpoints and environments. This paper addresses these challenges by curating a novel dataset and proposing a generation framework that decouples identity information from backbone representations to achieve robust, view-invariant human synthesis.

## Methodology

### 1. The Actor-18M Dataset
A large-scale dataset constructed to provide dense, identity-consistent supervision.
*   **Collection & Filtering**: 1.6M single-person videos are collected and filtered using facial similarity (Deng et al., 2019) and dense point tracking to ensure subject consistency.
*   **Construction Pipeline**: Comprises three subsets:
    *   **Actor-18M-A (View-Aug)**: Synthesizes view-transformed face/body images from six angles per subject using a multi-angle image editing model to mitigate frontal-view bias and "pose-locking".
    *   **Actor-18M-B (Attr-Aug)**: Applies attribute-conditioned image editing (environments, lighting, expressions, motions) to diversify backgrounds and styles while preserving identity.
    *   **Actor-18M-C (3-View)**: Provides canonical three-view (front, side, back) identity anchors generated from high-visibility frames.
*   **Statistics**: Key statistics showing the mitigation of frontal bias through augmentation:

**Table 1: Detailed statistics of Actor-18M (Abridged).**
| Subset | Region | Source | Quantity | Viewpoint Distribution (%) |
| :--- | :--- | :--- | :--- | :--- |
| **A** | Body | Self-Crop | 1.64M | F:62.8 / S:36.6 / B:0.6 |
| | Body | View-Aug | 8.73M | F:26.2 / S:71.6 / B:2.2 |
| **Total** | Body | Generated | 8.98M | F:27.3 / **S:70.5** / B:2.2 |

### 2. The W ILD A CTOR Framework
A framework for any-view conditioned human video generation, built on a latent video DiT trained with Rectified Flow (RF). The RF objective is:

$$
L_{\text{RF}} := \mathbb{E}_{t, z_0, \epsilon} \left[ w(t) \| v_\theta(z_t, t, C_{\text{ctx}}) - (\epsilon - z_0) \|_2^2 \right]
$$

where $z_t := (1 - t) z_0 + t \epsilon$, $z_0$ is the latent video, $\epsilon$ is noise, $t \in [0,1]$, $w(t)$ is a weighting function, and $C_{\text{ctx}} = \{C_{\text{txt}}, I^f, I^b\}$ aggregates text prompt and reference images.

**Core Components:**
*   **Asymmetric Identity-Preserving Attention (AIPA)**: Enforces a unidirectional information flow.
    1.  **Reference-only LoRA**: Lightweight LoRA modules are applied exclusively to reference tokens. For reference tokens $c \in \{f_{\text{face}}, f_{\text{body}}\}$, projections are:
        $$q_c, k_c, v_c = (W_{Q,K,V} + \Delta W^{\text{ref}}_{Q,K,V}) c$$
        where $\Delta W^{\text{ref}}_{Q,K,V}$ are learnable LoRA parameters. Video tokens use frozen backbone weights.
    2.  **Asymmetric Attention Flow**: Video tokens query a unified identity representation $C_{\text{ref}}$ aggregated from reference tokens, but reference tokens do not attend to noisy video latents. Keys and Values are concatenated: $K = [z_t; C_{\text{ref}}]$, $V = [z_t; C_{\text{ref}}]$.
*   **Identity-Aware 3D RoPE (I-RoPE)**: Assigns distinct spatio-temporal coordinates to separate reference tokens from video tokens, preventing ambiguity. Reference tokens are assigned fixed temporal offsets ($T+\Delta_f$, $T+\Delta_b$) and shifted spatial coordinates starting from $(H_{\text{max}}, W_{\text{max}})$.
*   **Viewpoint-Adaptive Monte Carlo Sampling**: A training strategy that dynamically re-weights reference image sampling probabilities. After sampling a reference $x^*$, the weights of candidates within its angular neighborhood $|\theta_{x^*} - \theta_{x_j}| < \delta$ are decayed:
    $$w_j \leftarrow w_j \cdot \gamma$$
    where $\gamma < 1$ is a decay factor. This encourages the model to observe complementary viewpoints.

## Empirical Validation / Results

### Evaluation Setup: Actor-Bench
*   **Settings**: Evaluates 75 subjects across three conditioning settings: canonical three-view, arbitrary viewpoint, and in-the-wild.
*   **Axes**: (1) **Sequential Narrative**: Coherent 3-prompt storylines. (2) **Contextual Generalization**: Single prompts with diverse environments/viewpoints/motions.
*   **Metrics**: Body Consistency (VLM-based), Face Identity Preservation (ArcFace cosine similarity), Semantic Alignment (ViCLIP & VLM-based).

### Quantitative Results
**Table 2: Quantitative comparisons on Actor-Bench.**
| Method | Params | Face Identity ↑ | Body Consistency ↑ | Semantic Alignment (VLM) ↑ |
| :--- | :--- | :--- | :--- | :--- |
| **Sequential Narrative** | | | | |
| T2V → I2V (w/o Ref) | 5B | 0.320 | 0.450 | 0.613 |
| W ILD A CTOR (w Ref) | 5B | **0.548** | **0.925** | **0.893** |
| **Contextual Generalization** | | | | |
| VACE | 14B | 0.485 | 0.582 | 0.667 |
| Stand-In | 14B | 0.510 | 0.416 | 0.600 |
| Vidu Q2* | – | 0.565 | 0.905 | 0.880 |
| Kling 1.6* | – | 0.558 | 0.885 | 0.867 |
| **W ILD A CTOR** | **5B** | **0.559** | **0.952** | **0.920** |

*W ILD A CTOR achieves state-of-the-art body consistency and semantic alignment, outperforming larger open-source and commercial models, especially in challenging viewpoint scenarios.*

### Ablation Studies
**Table 3: Ablation of dataset & sampling strategy (Body Consistency).**
| Setting | Front ↑ | Side ↑ | Back ↑ | Average ↑ |
| :--- | :--- | :--- | :--- | :--- |
| Raw-Crop | 0.885 | 0.725 | 0.680 | 0.802 |
| Random Sampling | 0.915 | 0.840 | 0.785 | 0.865 |
| **Viewpoint-Adaptive** | **0.958** | **0.952** | **0.937** | **0.952** |

*The proposed dataset and adaptive sampling strategy are crucial for achieving robustness across all viewpoints.*

**Table 4: Ablation of model components.**
| Setting | AIPA | I-RoPE | Face ID ↑ | Body Cons. ↑ | Sem. Align (VLM) ↑ |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Full-Attn | ✗ | ✓ | 0.515 | 0.890 | 0.610 |
| w/ AIPA only | ✓ | ✗ | 0.542 | 0.825 | 0.895 |
| **W ILD A CTOR** | ✓ | ✓ | **0.559** | **0.952** | **0.920** |

*Both AIPA (prevents semantic conflict) and I-RoPE (ensures structural coherence) are essential for optimal performance.*

## Theoretical and Practical Implications
*   **Theoretical**: Introduces a principled approach to decoupling static identity information from dynamic scene generation in diffusion models via asymmetric attention flows and token-space separation (I-RoPE). The viewpoint-adaptive sampling strategy provides a method for encouraging balanced coverage of data manifolds.
*   **Practical**: Enables the generation of long-form, human-centric narratives with consistent digital actors, which is a critical step towards production-ready video synthesis for applications in film, animation, gaming, and virtual reality. The release of the **Actor-18M** dataset and **Actor-Bench** benchmark facilitates further research in identity-preserving generation.

## Conclusion
This work addresses the core challenge of identity-consistent human video generation under unconstrained viewpoints and motions. The contributions are threefold: 1) the **Actor-18M** dataset mitigates viewpoint bias and provides rich supervision; 2) the **W ILD A CTOR** framework, with its **AIPA** mechanism and **I-RoPE**, robustly preserves full-body identity without sacrificing motion quality or prompt adherence; 3) the **Actor-Bench** evaluation shows superior performance over existing methods. Future directions may include extending the framework to multi-subject interactions and achieving finer-grained control over facial expressions and gestures.

---

_Markdown view of https://picx.dev/p/88HQe9, served by PicX — AI-generated visual whiteboard summaries of research papers._
