Summary (Overview)

  • SCAIL-2 proposes an end-to-end conditioning paradigm for controlled character animation, bypassing intermediate representations (pose skeletons, masks) and directly using driving video as input.
  • It introduces MotionPair-60K, a synthetic heterogeneous dataset of motion-transfer pairs covering animation and replacement tasks, curated via an agentic editing loop and reverse-driving training.
  • The framework unifies multiple sub-tasks (single/multi-character animation, character replacement) using in-context mask conditioning and mode-specific shifted RoPE for soft guidance.
  • A novel Bias-Aware DPO post-training mechanism refines fine-grained motion capture, especially in hand regions affected by synthetic data errors.
  • Extensive experiments show SCAIL-2 outperforms state-of-the-art methods in cross-identity motion fidelity, environment integration, and multi-character interactions, with strong zero-shot generalization.

Introduction and Theoretical Foundation

Controlled character animation aims to transfer motion from a driving sequence to a reference character. Prior approaches rely on intermediate representations: pose skeletons (e.g., from off-the-shelf estimators) or masked backgrounds for environment affordance. These intermediates suffer from information loss: skeletons are ambiguous under complex interactions (e.g., occlusions), while masks limit body shape adaptability. End-to-end conditioning directly provides the driving context as visual input, preserving occlusions, environments, and fine-grained details. However, this paradigm requires paired data where different characters perform the same motion in the same or different environments – such data is scarce.

SCAIL-2 addresses this by synthesizing paired data from pose-driven models and a replacement generator, then using a reverse driving scheme: synthetic video serves as driving input, while the original real video serves as denoising target. This avoids introducing artifacts from the generator. The paper unifies sub-tasks (Animation Mode: character in original background; Replacement Mode: character in driving background) via a decomposition into three learning objectives:

  • O1 (Motion Binding): Extract motion from driving video and route to bound target characters.
  • O2 (Environment Weaving): Use prescribed environment source (reference or driving) for coherent composition.
  • O3 (Universal Transfer): Disentangle pose from identity for any-to-any motion transfer.

Methodology

3.1 Preliminary

Given a latent video diffusion model (based on Wan2.1 I2V), the forward diffusion process corrupts latent z0z_0 over TT timesteps:

q(ztzt1)=N(zt;1βtzt1,βtI)(1)q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{1-\beta_t} z_{t-1}, \beta_t I\right) \tag{1}

Denoising model ϵθ(zt,t,c)\epsilon_\theta(z_t, t, c) is trained to recover noise conditioned on auxiliary input cc:

L=Ezt,ϵN(0,I)[ϵϵθ(zt,t,c)22](2)\mathcal{L} = \mathbb{E}_{z_t, \epsilon \sim \mathcal{N}(0,I)} \left[ \| \epsilon - \epsilon_\theta(z_t, t, c) \|_2^2 \right] \tag{2}

For end-to-end conditioning, the driving video is directly encoded via VAE: zdriv=E(y)z_{\text{driv}} = \mathcal{E}(y), bypassing explicit pose extraction.

3.2 End-to-end Data Synthesis

An Animation Synthetic Loop generates synthetic video y~\tilde{y} from driving video yy and reference image II using generator G\mathcal{G}:

y~=G(y,I)(3)\tilde{y} = \mathcal{G}(y, I) \tag{3}

The pipeline uses an agentic loop with Candidate Selector, Prompt Weaver, Quality Checker, and multi-reference image generation model (Google DeepMind) to produce plausible reference images. For replacement data, a renderer-trained model (MoCha) is used. Multi-character animation pairs are substituted with multi-character replacement data (more tractable). The resulting dataset MotionPair-60K has animation:replacement ratio ~3:1. In training, reverse driving is used: synthetic y~\tilde{y} is the driving input, real video yy is the target, alongside reference frame II from yy.

3.3 Model Design

Architecture: In-Context Driving design – condition tokens are concatenated to denoised sequence: input is [zref;zt;zdriv][z_{\text{ref}}; z_t; z_{\text{driv}}], with zdrivz_{\text{driv}} having a fixed spatial offset ΔW\Delta W.

In-Context Mask Conditioning: Adds 1 channel as environment switch (Animation vs Replacement) and KK channels as binding slots describing motion-character binding. Masks are derived from reference and driving sequences using SAM3, not from ground truth. The masks provide enhanced guidance without altering visual context.

Mode-Specific Shifted RoPE: Different temporal/spatial RoPE coordinates for Animation and Replacement modes to model their differences. Table 1 summarizes coordinates:

thw
Animation Mode
zrefz_{\text{ref}}0[0,Hv)[0, H_v)[0,Wv)[0, W_v)
ztz_t[1,Tv][1, T_v][0,Hv)[0, H_v)[0,Wv)[0, W_v)
zdrivz_{\text{driv}}[1,Tv][1, T_v][0,Hv)[0, H_v)[ΔW,ΔW+Wv)[\Delta W, \Delta W+W_v)
Replacement Mode
zrefz_{\text{ref}}0[ΔrefH,ΔrefH+Hv)[\Delta^H_{\text{ref}}, \Delta^H_{\text{ref}}+H_v)[0,Wv)[0, W_v)
ztz_t[0,Tv1][0, T_v-1][0,Hv)[0, H_v)[0,Wv)[0, W_v)
zdrivz_{\text{driv}}[0,Tv1][0, T_v-1][0,Hv)[0, H_v)[ΔW,ΔW+Wv)[\Delta W, \Delta W+W_v)

3.4 Post Training: Bias-Aware DPO

To mitigate errors from synthetic data (especially in hand regions), a preference dataset is constructed:

  • Given driving video yy, pose estimator PP, generator G\mathcal{G}, synthesize r=G(P(y),R)r = \mathcal{G}(P(y), R) and s=G(P(y),S)s = \mathcal{G}(P(y), S) with same pose but different reference images.
  • Negative sample rr^- obtained by one more round of error propagation:
r=G(P(G(P(y),R)),R)(6)r^- = \mathcal{G}\left(P''\left(\mathcal{G}\left(P'(y), R\right)\right), R\right) \tag{6}

Preference tuple: (s,R1,r,r)(s, R_1, r, r^-), where (s,R1)(s, R_1) are conditioning inputs, rr is preferred, rr^- less preferred. DPO-based optimization is used.

Empirical Validation / Results

Quantitative Evaluation

Cross-Identity Human Evaluation (Figs. 5-7): SCAIL-2 wins against open-source models (SCAIL, Wan-Animate) and proprietary Kling 3.0 in motion consistency, physical plausibility, and identity consistency for single-character, multi-character, and replacement tasks. For multi-character, it achieves 90% win rate in identity isolation against Wan-Animate.

Pose-Driven Metrics (Table 2): On Studio-Bench (single-character split):

MethodSSIM ↑PSNR ↑LPIPS ↓FVD ↓
Ours + SAM3D-Body Mesh0.645319.090.2231287.11
Ours + NLF-Pose Skeleton0.637018.760.2285282.85
SCAIL + SAM3D-Body Skeleton0.640719.080.2212309.63
SCAIL + NLF-Pose Skeleton0.637819.080.2212312.79
Wan-Animate0.634018.620.2269305.31

SCAIL-2 with SAM3D-Body mesh (zero-shot) shows best FVD and competitive SSIM/PSNR, demonstrating advantage of end-to-end information extraction.

Video-Bench Evaluation (Table 3): On X-dance, SCAIL-2 achieves best Imaging Quality (4.43) and Appearance Consistency (4.38).

Qualitative Evaluation (Fig. 8)

SCAIL-2 produces accurate motions with superior identity consistency, precise human-object interactions (e.g., handling ball), and natural environment integration. For replacement mode, it outperforms MoCha and Wan-Animate in handling crossing crowds and avoiding artifacts.

Ablation Studies (Fig. 9)

  • Driving modes: End-to-end driving outperforms pose-driven for complex interactions (e.g., fighting).
  • Network modules: Environment switch and Mode-Specific RoPE are essential for unifying modes. Binding slots help maintain identity under pedestrian overlap.
  • Data composition: Animation data and replacement data show synergy; removal of one degrades performance on cross-body-shape or overlap scenarios.
  • Bias-Aware DPO (Fig. 10): Outperforms base model and SFT variant in hand detail; also refines mouth/shoulders.

Theoretical and Practical Implications

Theoretical: The paper demonstrates that end-to-end conditioning, by preserving full visual information, overcomes limitations of intermediate representations in complex scenarios. The reverse-driving training paradigm effectively decouples motion extraction from environment rendering, allowing models to benefit from synthetic data without inheriting generator artifacts. The unification via mask conditioning and RoPE provides a principled way to share optimization across tasks.

Practical: SCAIL-2 enables robust character animation for production: it handles non-human driving sources, complex multi-character interactions, and character replacement with natural environment integration. The open-source release of model weights and synthetic data facilitates further research. The Bias-Aware DPO offers a practical method to refine fine-grained motion from imperfect synthetic data.

Conclusion

SCAIL-2 presents an end-to-end framework for controlled character animation that unifies multiple sub-tasks. Key contributions: an end-to-end conditioning paradigm bypassing intermediates, the MotionPair-60K synthetic dataset, in-context mask conditioning and mode-specific RoPE for task unification, and Bias-Aware DPO for fine-grained motion refinement. Extensive experiments show state-of-the-art performance in cross-identity motion following, environment integration, and multi-character interactions. Limitations include dependency on synthetic data quality; future work could extend to lip-syncing and facial expressions. The framework is positioned to benefit from advances in data synthesis.

Related papers