Visual Summary | SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

Summary (Overview)

SCAIL-2 proposes an end-to-end conditioning paradigm for controlled character animation, bypassing intermediate representations (pose skeletons, masks) and directly using driving video as input.
It introduces MotionPair-60K, a synthetic heterogeneous dataset of motion-transfer pairs covering animation and replacement tasks, curated via an agentic editing loop and reverse-driving training.
The framework unifies multiple sub-tasks (single/multi-character animation, character replacement) using in-context mask conditioning and mode-specific shifted RoPE for soft guidance.
A novel Bias-Aware DPO post-training mechanism refines fine-grained motion capture, especially in hand regions affected by synthetic data errors.
Extensive experiments show SCAIL-2 outperforms state-of-the-art methods in cross-identity motion fidelity, environment integration, and multi-character interactions, with strong zero-shot generalization.

Introduction and Theoretical Foundation

Controlled character animation aims to transfer motion from a driving sequence to a reference character. Prior approaches rely on intermediate representations: pose skeletons (e.g., from off-the-shelf estimators) or masked backgrounds for environment affordance. These intermediates suffer from information loss: skeletons are ambiguous under complex interactions (e.g., occlusions), while masks limit body shape adaptability. End-to-end conditioning directly provides the driving context as visual input, preserving occlusions, environments, and fine-grained details. However, this paradigm requires paired data where different characters perform the same motion in the same or different environments – such data is scarce.

SCAIL-2 addresses this by synthesizing paired data from pose-driven models and a replacement generator, then using a reverse driving scheme: synthetic video serves as driving input, while the original real video serves as denoising target. This avoids introducing artifacts from the generator. The paper unifies sub-tasks (Animation Mode: character in original background; Replacement Mode: character in driving background) via a decomposition into three learning objectives:

O1 (Motion Binding): Extract motion from driving video and route to bound target characters.
O2 (Environment Weaving): Use prescribed environment source (reference or driving) for coherent composition.
O3 (Universal Transfer): Disentangle pose from identity for any-to-any motion transfer.

Methodology

3.1 Preliminary

Given a latent video diffusion model (based on Wan2.1 I2V), the forward diffusion process corrupts latent $z_0$ over $T$ timesteps:

q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{1-\beta_t} z_{t-1}, \beta_t I\right) \tag{1}

Denoising model $\epsilon_\theta(z_t, t, c)$ is trained to recover noise conditioned on auxiliary input $c$ :

\mathcal{L} = \mathbb{E}_{z_t, \epsilon \sim \mathcal{N}(0,I)} \left[ \| \epsilon - \epsilon_\theta(z_t, t, c) \|_2^2 \right] \tag{2}

For end-to-end conditioning, the driving video is directly encoded via VAE: $z_{\text{driv}} = \mathcal{E}(y)$ , bypassing explicit pose extraction.

3.2 End-to-end Data Synthesis

An Animation Synthetic Loop generates synthetic video $\tilde{y}$ from driving video $y$ and reference image $I$ using generator $\mathcal{G}$ :

\tilde{y} = \mathcal{G}(y, I) \tag{3}

The pipeline uses an agentic loop with Candidate Selector, Prompt Weaver, Quality Checker, and multi-reference image generation model (Google DeepMind) to produce plausible reference images. For replacement data, a renderer-trained model (MoCha) is used. Multi-character animation pairs are substituted with multi-character replacement data (more tractable). The resulting dataset MotionPair-60K has animation:replacement ratio ~3:1. In training, reverse driving is used: synthetic $\tilde{y}$ is the driving input, real video $y$ is the target, alongside reference frame $I$ from $y$ .

3.3 Model Design

Architecture: In-Context Driving design – condition tokens are concatenated to denoised sequence: input is $[z_{\text{ref}}; z_t; z_{\text{driv}}]$ , with $z_{\text{driv}}$ having a fixed spatial offset $\Delta W$ .

In-Context Mask Conditioning: Adds 1 channel as environment switch (Animation vs Replacement) and $K$ channels as binding slots describing motion-character binding. Masks are derived from reference and driving sequences using SAM3, not from ground truth. The masks provide enhanced guidance without altering visual context.

Mode-Specific Shifted RoPE: Different temporal/spatial RoPE coordinates for Animation and Replacement modes to model their differences. Table 1 summarizes coordinates:

	t	h	w
Animation Mode
$z_{\text{ref}}$	0	$[0, H_v)$	$[0, W_v)$
$z_t$	$[1, T_v]$	$[0, H_v)$	$[0, W_v)$
$z_{\text{driv}}$	$[1, T_v]$	$[0, H_v)$	$[\Delta W, \Delta W+W_v)$
Replacement Mode
$z_{\text{ref}}$	0	$[\Delta^H_{\text{ref}}, \Delta^H_{\text{ref}}+H_v)$	$[0, W_v)$
$z_t$	$[0, T_v-1]$	$[0, H_v)$	$[0, W_v)$
$z_{\text{driv}}$	$[0, T_v-1]$	$[0, H_v)$	$[\Delta W, \Delta W+W_v)$

3.4 Post Training: Bias-Aware DPO

To mitigate errors from synthetic data (especially in hand regions), a preference dataset is constructed:

Given driving video $y$ , pose estimator $P$ , generator $\mathcal{G}$ , synthesize $r = \mathcal{G}(P(y), R)$ and $s = \mathcal{G}(P(y), S)$ with same pose but different reference images.
Negative sample $r^-$ obtained by one more round of error propagation:

r^- = \mathcal{G}\left(P''\left(\mathcal{G}\left(P'(y), R\right)\right), R\right) \tag{6}

Preference tuple: $(s, R_1, r, r^-)$ , where $(s, R_1)$ are conditioning inputs, $r$ is preferred, $r^-$ less preferred. DPO-based optimization is used.

Empirical Validation / Results

Quantitative Evaluation

Cross-Identity Human Evaluation (Figs. 5-7): SCAIL-2 wins against open-source models (SCAIL, Wan-Animate) and proprietary Kling 3.0 in motion consistency, physical plausibility, and identity consistency for single-character, multi-character, and replacement tasks. For multi-character, it achieves 90% win rate in identity isolation against Wan-Animate.

Pose-Driven Metrics (Table 2): On Studio-Bench (single-character split):

Method	SSIM ↑	PSNR ↑	LPIPS ↓	FVD ↓
Ours + SAM3D-Body Mesh	0.6453	19.09	0.2231	287.11
Ours + NLF-Pose Skeleton	0.6370	18.76	0.2285	282.85
SCAIL + SAM3D-Body Skeleton	0.6407	19.08	0.2212	309.63
SCAIL + NLF-Pose Skeleton	0.6378	19.08	0.2212	312.79
Wan-Animate	0.6340	18.62	0.2269	305.31

SCAIL-2 with SAM3D-Body mesh (zero-shot) shows best FVD and competitive SSIM/PSNR, demonstrating advantage of end-to-end information extraction.

Video-Bench Evaluation (Table 3): On X-dance, SCAIL-2 achieves best Imaging Quality (4.43) and Appearance Consistency (4.38).

Qualitative Evaluation (Fig. 8)

SCAIL-2 produces accurate motions with superior identity consistency, precise human-object interactions (e.g., handling ball), and natural environment integration. For replacement mode, it outperforms MoCha and Wan-Animate in handling crossing crowds and avoiding artifacts.

Ablation Studies (Fig. 9)

Driving modes: End-to-end driving outperforms pose-driven for complex interactions (e.g., fighting).
Network modules: Environment switch and Mode-Specific RoPE are essential for unifying modes. Binding slots help maintain identity under pedestrian overlap.
Data composition: Animation data and replacement data show synergy; removal of one degrades performance on cross-body-shape or overlap scenarios.
Bias-Aware DPO (Fig. 10): Outperforms base model and SFT variant in hand detail; also refines mouth/shoulders.

Theoretical and Practical Implications

Theoretical: The paper demonstrates that end-to-end conditioning, by preserving full visual information, overcomes limitations of intermediate representations in complex scenarios. The reverse-driving training paradigm effectively decouples motion extraction from environment rendering, allowing models to benefit from synthetic data without inheriting generator artifacts. The unification via mask conditioning and RoPE provides a principled way to share optimization across tasks.

Practical: SCAIL-2 enables robust character animation for production: it handles non-human driving sources, complex multi-character interactions, and character replacement with natural environment integration. The open-source release of model weights and synthetic data facilitates further research. The Bias-Aware DPO offers a practical method to refine fine-grained motion from imperfect synthetic data.

Conclusion

SCAIL-2 presents an end-to-end framework for controlled character animation that unifies multiple sub-tasks. Key contributions: an end-to-end conditioning paradigm bypassing intermediates, the MotionPair-60K synthetic dataset, in-context mask conditioning and mode-specific RoPE for task unification, and Bias-Aware DPO for fine-grained motion refinement. Extensive experiments show state-of-the-art performance in cross-identity motion following, environment integration, and multi-character interactions. Limitations include dependency on synthetic data quality; future work could extend to lip-syncing and facial expressions. The framework is positioned to benefit from advances in data synthesis.