Summary (Overview)
- SCAIL-2 proposes an end-to-end conditioning paradigm for controlled character animation, bypassing intermediate representations (pose skeletons, masks) and directly using driving video as input.
- It introduces MotionPair-60K, a synthetic heterogeneous dataset of motion-transfer pairs covering animation and replacement tasks, curated via an agentic editing loop and reverse-driving training.
- The framework unifies multiple sub-tasks (single/multi-character animation, character replacement) using in-context mask conditioning and mode-specific shifted RoPE for soft guidance.
- A novel Bias-Aware DPO post-training mechanism refines fine-grained motion capture, especially in hand regions affected by synthetic data errors.
- Extensive experiments show SCAIL-2 outperforms state-of-the-art methods in cross-identity motion fidelity, environment integration, and multi-character interactions, with strong zero-shot generalization.
Introduction and Theoretical Foundation
Controlled character animation aims to transfer motion from a driving sequence to a reference character. Prior approaches rely on intermediate representations: pose skeletons (e.g., from off-the-shelf estimators) or masked backgrounds for environment affordance. These intermediates suffer from information loss: skeletons are ambiguous under complex interactions (e.g., occlusions), while masks limit body shape adaptability. End-to-end conditioning directly provides the driving context as visual input, preserving occlusions, environments, and fine-grained details. However, this paradigm requires paired data where different characters perform the same motion in the same or different environments – such data is scarce.
SCAIL-2 addresses this by synthesizing paired data from pose-driven models and a replacement generator, then using a reverse driving scheme: synthetic video serves as driving input, while the original real video serves as denoising target. This avoids introducing artifacts from the generator. The paper unifies sub-tasks (Animation Mode: character in original background; Replacement Mode: character in driving background) via a decomposition into three learning objectives:
- O1 (Motion Binding): Extract motion from driving video and route to bound target characters.
- O2 (Environment Weaving): Use prescribed environment source (reference or driving) for coherent composition.
- O3 (Universal Transfer): Disentangle pose from identity for any-to-any motion transfer.
Methodology
3.1 Preliminary
Given a latent video diffusion model (based on Wan2.1 I2V), the forward diffusion process corrupts latent over timesteps:
Denoising model is trained to recover noise conditioned on auxiliary input :
For end-to-end conditioning, the driving video is directly encoded via VAE: , bypassing explicit pose extraction.
3.2 End-to-end Data Synthesis
An Animation Synthetic Loop generates synthetic video from driving video and reference image using generator :
The pipeline uses an agentic loop with Candidate Selector, Prompt Weaver, Quality Checker, and multi-reference image generation model (Google DeepMind) to produce plausible reference images. For replacement data, a renderer-trained model (MoCha) is used. Multi-character animation pairs are substituted with multi-character replacement data (more tractable). The resulting dataset MotionPair-60K has animation:replacement ratio ~3:1. In training, reverse driving is used: synthetic is the driving input, real video is the target, alongside reference frame from .
3.3 Model Design
Architecture: In-Context Driving design – condition tokens are concatenated to denoised sequence: input is , with having a fixed spatial offset .
In-Context Mask Conditioning: Adds 1 channel as environment switch (Animation vs Replacement) and channels as binding slots describing motion-character binding. Masks are derived from reference and driving sequences using SAM3, not from ground truth. The masks provide enhanced guidance without altering visual context.
Mode-Specific Shifted RoPE: Different temporal/spatial RoPE coordinates for Animation and Replacement modes to model their differences. Table 1 summarizes coordinates:
| t | h | w | |
|---|---|---|---|
| Animation Mode | |||
| 0 | |||
| Replacement Mode | |||
| 0 | |||
3.4 Post Training: Bias-Aware DPO
To mitigate errors from synthetic data (especially in hand regions), a preference dataset is constructed:
- Given driving video , pose estimator , generator , synthesize and with same pose but different reference images.
- Negative sample obtained by one more round of error propagation:
Preference tuple: , where are conditioning inputs, is preferred, less preferred. DPO-based optimization is used.
Empirical Validation / Results
Quantitative Evaluation
Cross-Identity Human Evaluation (Figs. 5-7): SCAIL-2 wins against open-source models (SCAIL, Wan-Animate) and proprietary Kling 3.0 in motion consistency, physical plausibility, and identity consistency for single-character, multi-character, and replacement tasks. For multi-character, it achieves 90% win rate in identity isolation against Wan-Animate.
Pose-Driven Metrics (Table 2): On Studio-Bench (single-character split):
| Method | SSIM ↑ | PSNR ↑ | LPIPS ↓ | FVD ↓ |
|---|---|---|---|---|
| Ours + SAM3D-Body Mesh | 0.6453 | 19.09 | 0.2231 | 287.11 |
| Ours + NLF-Pose Skeleton | 0.6370 | 18.76 | 0.2285 | 282.85 |
| SCAIL + SAM3D-Body Skeleton | 0.6407 | 19.08 | 0.2212 | 309.63 |
| SCAIL + NLF-Pose Skeleton | 0.6378 | 19.08 | 0.2212 | 312.79 |
| Wan-Animate | 0.6340 | 18.62 | 0.2269 | 305.31 |
SCAIL-2 with SAM3D-Body mesh (zero-shot) shows best FVD and competitive SSIM/PSNR, demonstrating advantage of end-to-end information extraction.
Video-Bench Evaluation (Table 3): On X-dance, SCAIL-2 achieves best Imaging Quality (4.43) and Appearance Consistency (4.38).
Qualitative Evaluation (Fig. 8)
SCAIL-2 produces accurate motions with superior identity consistency, precise human-object interactions (e.g., handling ball), and natural environment integration. For replacement mode, it outperforms MoCha and Wan-Animate in handling crossing crowds and avoiding artifacts.
Ablation Studies (Fig. 9)
- Driving modes: End-to-end driving outperforms pose-driven for complex interactions (e.g., fighting).
- Network modules: Environment switch and Mode-Specific RoPE are essential for unifying modes. Binding slots help maintain identity under pedestrian overlap.
- Data composition: Animation data and replacement data show synergy; removal of one degrades performance on cross-body-shape or overlap scenarios.
- Bias-Aware DPO (Fig. 10): Outperforms base model and SFT variant in hand detail; also refines mouth/shoulders.
Theoretical and Practical Implications
Theoretical: The paper demonstrates that end-to-end conditioning, by preserving full visual information, overcomes limitations of intermediate representations in complex scenarios. The reverse-driving training paradigm effectively decouples motion extraction from environment rendering, allowing models to benefit from synthetic data without inheriting generator artifacts. The unification via mask conditioning and RoPE provides a principled way to share optimization across tasks.
Practical: SCAIL-2 enables robust character animation for production: it handles non-human driving sources, complex multi-character interactions, and character replacement with natural environment integration. The open-source release of model weights and synthetic data facilitates further research. The Bias-Aware DPO offers a practical method to refine fine-grained motion from imperfect synthetic data.
Conclusion
SCAIL-2 presents an end-to-end framework for controlled character animation that unifies multiple sub-tasks. Key contributions: an end-to-end conditioning paradigm bypassing intermediates, the MotionPair-60K synthetic dataset, in-context mask conditioning and mode-specific RoPE for task unification, and Bias-Aware DPO for fine-grained motion refinement. Extensive experiments show state-of-the-art performance in cross-identity motion following, environment integration, and multi-character interactions. Limitations include dependency on synthetic data quality; future work could extend to lip-syncing and facial expressions. The framework is positioned to benefit from advances in data synthesis.
Related papers
- MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
M3 with MaxProof achieves 35/42 on IMO 2025 and 36/42 on USAMO 2026, exceeding human gold-medal thresholds.
- Latent Spatial Memory for Video World Models
Latent spatial memory stores features in VAE latent space, achieving SOTA video consistency with 10.57x speedup and 55x less GPU memory.
- InterleaveThinker: Reinforcing Agentic Interleaved Generation
InterleaveThinker uses decoupled planner-critic agents to enable any frozen image generator to achieve state-of-the-art interleaved generation.