# SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

> SCAIL-2 introduces end-to-end video conditioning for character animation, achieving state-of-the-art motion fidelity and multi-character interactions without pose skeletons.

- **Source:** [arXiv](https://arxiv.org/abs/2606.10804)
- **Published:** 2026-06-11
- **Permalink:** https://picx.dev/p/yW29jv
- **Whiteboard:** https://picx.dev/p/yW29jv/image

## Summary

## Summary (Overview)

- **SCAIL-2** proposes an **end-to-end conditioning paradigm** for controlled character animation, bypassing intermediate representations (pose skeletons, masks) and directly using driving video as input.
- It introduces **MotionPair-60K**, a synthetic heterogeneous dataset of motion-transfer pairs covering animation and replacement tasks, curated via an agentic editing loop and reverse-driving training.
- The framework unifies multiple sub-tasks (single/multi-character animation, character replacement) using **in-context mask conditioning** and **mode-specific shifted RoPE** for soft guidance.
- A novel **Bias-Aware DPO** post-training mechanism refines fine-grained motion capture, especially in hand regions affected by synthetic data errors.
- Extensive experiments show SCAIL-2 outperforms state-of-the-art methods in cross-identity motion fidelity, environment integration, and multi-character interactions, with strong zero-shot generalization.

## Introduction and Theoretical Foundation

Controlled character animation aims to transfer motion from a driving sequence to a reference character. Prior approaches rely on intermediate representations: pose skeletons (e.g., from off-the-shelf estimators) or masked backgrounds for environment affordance. These intermediates suffer from information loss: skeletons are ambiguous under complex interactions (e.g., occlusions), while masks limit body shape adaptability. End-to-end conditioning directly provides the driving context as visual input, preserving occlusions, environments, and fine-grained details. However, this paradigm requires paired data where different characters perform the same motion in the same or different environments – such data is scarce.

SCAIL-2 addresses this by synthesizing paired data from pose-driven models and a replacement generator, then using a **reverse driving** scheme: synthetic video serves as driving input, while the original real video serves as denoising target. This avoids introducing artifacts from the generator. The paper unifies sub-tasks (Animation Mode: character in original background; Replacement Mode: character in driving background) via a decomposition into three learning objectives:
- **O1 (Motion Binding):** Extract motion from driving video and route to bound target characters.
- **O2 (Environment Weaving):** Use prescribed environment source (reference or driving) for coherent composition.
- **O3 (Universal Transfer):** Disentangle pose from identity for any-to-any motion transfer.

## Methodology

### 3.1 Preliminary

Given a latent video diffusion model (based on Wan2.1 I2V), the forward diffusion process corrupts latent $z_0$ over $T$ timesteps:

$$
q(z_t | z_{t-1}) = \mathcal{N}\left(z_t; \sqrt{1-\beta_t} z_{t-1}, \beta_t I\right)
\tag{1}
$$

Denoising model $\epsilon_\theta(z_t, t, c)$ is trained to recover noise conditioned on auxiliary input $c$:

$$
\mathcal{L} = \mathbb{E}_{z_t, \epsilon \sim \mathcal{N}(0,I)} \left[ \| \epsilon - \epsilon_\theta(z_t, t, c) \|_2^2 \right]
\tag{2}
$$

For end-to-end conditioning, the driving video is directly encoded via VAE: $z_{\text{driv}} = \mathcal{E}(y)$, bypassing explicit pose extraction.

### 3.2 End-to-end Data Synthesis

An **Animation Synthetic Loop** generates synthetic video $\tilde{y}$ from driving video $y$ and reference image $I$ using generator $\mathcal{G}$:

$$
\tilde{y} = \mathcal{G}(y, I)
\tag{3}
$$

The pipeline uses an agentic loop with Candidate Selector, Prompt Weaver, Quality Checker, and multi-reference image generation model (Google DeepMind) to produce plausible reference images. For replacement data, a renderer-trained model (MoCha) is used. Multi-character animation pairs are substituted with multi-character replacement data (more tractable). The resulting dataset **MotionPair-60K** has animation:replacement ratio ~3:1. In training, reverse driving is used: synthetic $\tilde{y}$ is the driving input, real video $y$ is the target, alongside reference frame $I$ from $y$.

### 3.3 Model Design

**Architecture:** In-Context Driving design – condition tokens are concatenated to denoised sequence: input is $[z_{\text{ref}}; z_t; z_{\text{driv}}]$, with $z_{\text{driv}}$ having a fixed spatial offset $\Delta W$.

**In-Context Mask Conditioning:** Adds 1 channel as environment switch (Animation vs Replacement) and $K$ channels as **binding slots** describing motion-character binding. Masks are derived from reference and driving sequences using SAM3, not from ground truth. The masks provide enhanced guidance without altering visual context.

**Mode-Specific Shifted RoPE:** Different temporal/spatial RoPE coordinates for Animation and Replacement modes to model their differences. Table 1 summarizes coordinates:

| | t | h | w |
|---|---|---|---|
| **Animation Mode** | | | |
| $z_{\text{ref}}$ | 0 | $[0, H_v)$ | $[0, W_v)$ |
| $z_t$ | $[1, T_v]$ | $[0, H_v)$ | $[0, W_v)$ |
| $z_{\text{driv}}$ | $[1, T_v]$ | $[0, H_v)$ | $[\Delta W, \Delta W+W_v)$ |
| **Replacement Mode** | | | |
| $z_{\text{ref}}$ | 0 | $[\Delta^H_{\text{ref}}, \Delta^H_{\text{ref}}+H_v)$ | $[0, W_v)$ |
| $z_t$ | $[0, T_v-1]$ | $[0, H_v)$ | $[0, W_v)$ |
| $z_{\text{driv}}$ | $[0, T_v-1]$ | $[0, H_v)$ | $[\Delta W, \Delta W+W_v)$ |

### 3.4 Post Training: Bias-Aware DPO

To mitigate errors from synthetic data (especially in hand regions), a preference dataset is constructed:
- Given driving video $y$, pose estimator $P$, generator $\mathcal{G}$, synthesize $r = \mathcal{G}(P(y), R)$ and $s = \mathcal{G}(P(y), S)$ with same pose but different reference images.
- Negative sample $r^-$ obtained by one more round of error propagation:

$$
r^- = \mathcal{G}\left(P''\left(\mathcal{G}\left(P'(y), R\right)\right), R\right)
\tag{6}
$$

Preference tuple: $(s, R_1, r, r^-)$, where $(s, R_1)$ are conditioning inputs, $r$ is preferred, $r^-$ less preferred. DPO-based optimization is used.

## Empirical Validation / Results

### Quantitative Evaluation

**Cross-Identity Human Evaluation (Figs. 5-7):** SCAIL-2 wins against open-source models (SCAIL, Wan-Animate) and proprietary Kling 3.0 in motion consistency, physical plausibility, and identity consistency for single-character, multi-character, and replacement tasks. For multi-character, it achieves 90% win rate in identity isolation against Wan-Animate.

**Pose-Driven Metrics (Table 2):** On Studio-Bench (single-character split):

| Method | SSIM ↑ | PSNR ↑ | LPIPS ↓ | FVD ↓ |
|---|---|---|---|---|
| **Ours + SAM3D-Body Mesh** | 0.6453 | 19.09 | 0.2231 | 287.11 |
| **Ours + NLF-Pose Skeleton** | 0.6370 | 18.76 | 0.2285 | 282.85 |
| SCAIL + SAM3D-Body Skeleton | 0.6407 | 19.08 | 0.2212 | 309.63 |
| SCAIL + NLF-Pose Skeleton | 0.6378 | 19.08 | 0.2212 | 312.79 |
| Wan-Animate | 0.6340 | 18.62 | 0.2269 | 305.31 |

SCAIL-2 with SAM3D-Body mesh (zero-shot) shows best FVD and competitive SSIM/PSNR, demonstrating advantage of end-to-end information extraction.

**Video-Bench Evaluation (Table 3):** On X-dance, SCAIL-2 achieves best Imaging Quality (4.43) and Appearance Consistency (4.38).

### Qualitative Evaluation (Fig. 8)

SCAIL-2 produces accurate motions with superior identity consistency, precise human-object interactions (e.g., handling ball), and natural environment integration. For replacement mode, it outperforms MoCha and Wan-Animate in handling crossing crowds and avoiding artifacts.

### Ablation Studies (Fig. 9)

- **Driving modes:** End-to-end driving outperforms pose-driven for complex interactions (e.g., fighting).
- **Network modules:** Environment switch and Mode-Specific RoPE are essential for unifying modes. Binding slots help maintain identity under pedestrian overlap.
- **Data composition:** Animation data and replacement data show synergy; removal of one degrades performance on cross-body-shape or overlap scenarios.
- **Bias-Aware DPO (Fig. 10):** Outperforms base model and SFT variant in hand detail; also refines mouth/shoulders.

## Theoretical and Practical Implications

**Theoretical:** The paper demonstrates that end-to-end conditioning, by preserving full visual information, overcomes limitations of intermediate representations in complex scenarios. The reverse-driving training paradigm effectively decouples motion extraction from environment rendering, allowing models to benefit from synthetic data without inheriting generator artifacts. The unification via mask conditioning and RoPE provides a principled way to share optimization across tasks.

**Practical:** SCAIL-2 enables robust character animation for production: it handles non-human driving sources, complex multi-character interactions, and character replacement with natural environment integration. The open-source release of model weights and synthetic data facilitates further research. The Bias-Aware DPO offers a practical method to refine fine-grained motion from imperfect synthetic data.

## Conclusion

SCAIL-2 presents an end-to-end framework for controlled character animation that unifies multiple sub-tasks. Key contributions: an end-to-end conditioning paradigm bypassing intermediates, the MotionPair-60K synthetic dataset, in-context mask conditioning and mode-specific RoPE for task unification, and Bias-Aware DPO for fine-grained motion refinement. Extensive experiments show state-of-the-art performance in cross-identity motion following, environment integration, and multi-character interactions. Limitations include dependency on synthetic data quality; future work could extend to lip-syncing and facial expressions. The framework is positioned to benefit from advances in data synthesis.

---

_Markdown view of https://picx.dev/p/yW29jv, served by PicX — AI-generated visual whiteboard summaries of research papers._