# MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

> MACE-Dance introduces a cascaded framework that decouples motion and appearance, using a diffusion-based motion expert and a kinematic-aesthetic fine-tuned appearance expert to achieve state-of-the-art music-driven dance video generation.

- **Source:** [arXiv](https://arxiv.org/abs/2512.18181)
- **Published:** 2026-05-12
- **Permalink:** https://picx.dev/p/rZzfeY
- **Whiteboard:** https://picx.dev/p/rZzfeY/image

## Summary

# MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

## Summary (Overview)
- **Framework**: Introduces MACE-Dance, a cascaded Mixture-of-Experts (MoE) framework for music-driven dance video generation, decoupling motion and appearance synthesis.
- **Motion Expert**: Uses a Diffusion Model with a BiMamba-Transformer hybrid architecture and Guidance-Free Training (GFT) strategy to generate kinematically plausible and artistically expressive 3D dance motion from music.
- **Appearance Expert**: Adopts a decoupled Kinematic–Aesthetic fine-tuning strategy on Wan-Animate to synthesize high-fidelity videos from 3D motion and a reference image.
- **Dataset & Protocol**: Curates a large-scale dataset (MA-Data) and introduces a motion–appearance evaluation protocol to benchmark the task.
- **Performance**: Achieves state-of-the-art (SOTA) performance in music-driven dance video generation, as well as in the subtasks of music-driven 3D dance generation and pose-driven image animation.

## Introduction and Theoretical Foundation
Music-driven dance video generation is a timely research direction with the rise of online dance platforms and advances in AIGC. The task faces two key challenges: **(1)** generating kinematically plausible and artistically expressive dance motions, and **(2)** achieving high-fidelity visual appearance with strong spatiotemporal consistency. Existing approaches from related domains (music-driven 3D dance generation, pose-driven image animation, audio-driven talking-head synthesis) are not readily transferable due to fundamental mismatches. Research on music-driven dance video generation itself remains limited and often fails to capture the inherently 3D nature of dance, compromising motion and appearance quality.

**MACE-Dance** addresses these challenges through a **cascaded expert** design. The **Motion Expert** generates 3D motion from music, enforcing kinematic plausibility and artistic expressiveness. The **Appearance Expert** synthesizes videos conditioned on this 3D motion and a reference image, preserving visual identity and spatiotemporal coherence. Crucially, the framework uses **3D SMPL parameters** as the intermediate representation, rather than 2D keypoints, for three reasons:
1. **Richer spatial fidelity**: Preserves full-body geometric structure, including global translation and orientation.
2. **Cleaner supervision**: Disentangles pose from camera viewpoint and subject-specific appearance.
3. **Better robustness**: More robust to self-occlusion and viewpoint variation.

## Methodology
The overall objective is to synthesize dance videos $D \in R^{T \times H \times W \times 3}$ given a music sequence $M \in R^{T \times C_m}$ and a reference image $I \in R^{H \times W \times 3}$.

### Motion Expert
**Generative Strategy**: Uses a Diffusion Model with Guidance-Free Training (GFT).
- The forward noising process in DDPM is defined as:
$$q(z_t | x) \sim \mathcal{N}(\sqrt{\bar{\alpha}_t} x, (1 - \bar{\alpha}_t)I)$$
where $\bar{\alpha}_t \in (0,1)$ are constants following a monotonically decreasing schedule.
- With music conditioning $c$, the model learns to estimate $\hat{x}_\theta(z_t, t, c) \approx x$. GFT establishes $x_\beta$ as the new optimization target:
$$x_\beta = \beta \hat{x}_\theta(z_t, t, c, \beta) + (1 - \beta) \text{sg}[\hat{x}_\theta(z_t, t, \emptyset, 1)]$$
where $\emptyset$ denotes unconditional setting, $\text{sg}$ is stop-gradient, and $\beta \in [0,1]$ is a temperature parameter.
- The training loss combines reconstruction, 3D joint, velocity, and foot contact losses:
$$\mathcal{L} = \lambda_{\text{rec}}\mathcal{L}_{\text{rec}} + \lambda_{\text{joint}}\mathcal{L}_{\text{joint}} + \lambda_{\text{vel}}\mathcal{L}_{\text{vel}} + \lambda_{\text{foot}}\mathcal{L}_{\text{foot}}$$
where:
$$\mathcal{L}_{\text{rec}} = \mathbb{E}[\| x_\beta - x \|_2^2], \quad \mathcal{L}_{\text{joint}} = \mathbb{E}[\| FK(x_\beta) - FK(x) \|_2^2],$$
$$\mathcal{L}_{\text{vel}} = \mathbb{E}[\| FK(x_\beta)' - FK(x)' \|_2^2], \quad \mathcal{L}_{\text{foot}} = \mathbb{E}[\| FK(x_\beta)' \cdot \hat{b} \|_2^2]$$
$FK(\cdot)$ is the forward kinematic function, and $\hat{b}$ is the predicted binary foot contact label.

**Model Architecture**: Adopts a **BiMamba–Transformer hybrid backbone**.
- **BiMamba** captures intra-modal local dependencies in music or dance. The Selective State Space Model (Mamba) dynamics are:
$$h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t, \quad y_t = C_t h_t$$
where $\bar{A}_t, \bar{B}_t, C_t$ are dynamically updated. The state transitions are:
$$\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B$$
- **Transformer** models cross-modal global context via cross-attention:
$$\text{Attention} = \text{softmax}\left(\frac{Q_d \cdot K_m^T}{\sqrt{C}}\right) V_m$$
where motion features are queries and music features provide keys/values.
- The architecture enables **non-autoregressive** generation of entire sequences, improving efficiency and avoiding exposure bias.

### Appearance Expert
Built upon **Wan-Animate**, with a **decoupled Kinematic–Aesthetic fine-tuning strategy** to adapt it for dance video generation.

**Model Architecture**:
- A **3D-to-2D Motion Projector** converts the SMPL sequence from the Motion Expert into 2D keypoints (using pyrender and ViTPose) for Wan-Animate.
- **Kinematic Stage**: Fine-tunes only the **Body Adapter** (freezing other components) to strengthen kinematic conditioning and motion adherence.
- **Aesthetic Stage**: Attaches **LoRA** adapters to each DiT block for aesthetic refinement, enhancing texture fidelity and stylistic consistency while preserving pretrained priors. LoRA updates the weight matrix $W_0$ as:
$$W = W_0 + \Delta W = W_0 + AB$$
where $A \in R^{m \times r}$ and $B \in R^{r \times n}$ are low-rank matrices ($r \ll m, n$).

## Empirical Validation / Results

### Dataset: MA-Data
A large-scale dance video dataset curated for benchmarking.
- **Size**: 70k clips of 5–10 seconds each (116 hours total), spanning over 20 dance genres.
- **Composition**:
    1. **3D-rendered data (motion-centric)**: 20k clips (28 hours) derived from FineDance, rendered from 3D professional dancer motions.
    2. **In-the-wild internet data (appearance-centric)**: 50k clips (88 hours) collected from TikTok/YouTube, emphasizing visual appearance.
- **Test set**: 200 5-second clips from high-engagement videos across multiple genres.

### Evaluation Protocol
A **motion–appearance** evaluation protocol.
- **Motion dimension**: Assess fidelity, diversity, and synchronization from a Human-Kinematics perspective using 2D keypoints extracted by ViTPose.
    - **Metrics**: FID$_{k}$, FID$_{g}$, DIV$_{k}$, DIV$_{g}$, Beat Alignment Score (BAS).
- **Appearance dimension**: Uses **VBench** dance-specific metrics.
    - **Metrics**: Imaging Quality (IQ), Aesthetic Quality (AQ), Subject Consistency (SC), Background Consistency (BC), Motion Smoothness (MS), Temporal Flickering (TF).

### Quantitative Results

**Table 1: Quantitative comparison on MA-Data in Music-Driven Dance Video Generation**
| Method | IQ ↑ | AQ ↑ | SC ↑ | BC ↑ | MS ↑ | TF ↑ | FID$_k$ ↓ | FID$_g$ ↓ | DIV$_k$ ↑ | DIV$_g$ ↑ | BAS ↑ |
|--------|------|------|------|------|------|------|-----------|-----------|-----------|-----------|-------|
| Ground Truth | 67.12 | 53.51 | 91.86 | 92.97 | 98.20 | 96.88 | – | – | 9.24 | 5.31 | 0.526 |
| Hallo2 | 62.64 | 50.79 | 92.48 | 93.84 | 98.30 | 96.56 | 16.55 | 1.29 | 8.11 | 5.47 | 0.505 |
| WAN-S2V | 64.10 | 50.20 | 92.30 | 93.40 | 98.20 | 96.70 | 18.90 | 1.45 | 7.60 | 5.44 | 0.485 |
| Echomimic-V3 | 63.20 | 49.00 | 91.90 | 93.10 | 98.05 | 96.40 | 19.60 | 1.32 | 7.20 | 4.60 | 0.460 |
| EDGE | 63.05 | 49.70 | 91.79 | 93.30 | 98.64 | 97.10 | 21.77 | 1.39 | 9.08 | 5.74 | 0.498 |
| Lodge | 63.69 | 49.22 | 91.67 | 92.98 | 98.46 | 97.05 | 18.73 | 1.49 | 8.87 | 5.71 | 0.474 |
| MEGA | 66.14 | 49.89 | 92.95 | 94.13 | 97.45 | 96.32 | 18.98 | 1.65 | 8.78 | 5.59 | 0.513 |
| **MACE-Dance** | **65.35** | **51.79** | **93.97** | **94.57** | **98.46** | **97.10** | **16.46** | **0.28** | **9.74** | **6.34** | **0.523** |

**MACE-Dance achieves SOTA performance on both motion and appearance metrics.**

**Table 2: Quantitative comparison on FineDance in Music-Driven 3D Dance Generation**
| Method | FID$_k$ ↓ | FID$_g$ ↓ | FSR ↓ | DIV$_k$ ↑ | DIV$_g$ ↑ | BAS ↑ | FPS ↑ |
|--------|-----------|-----------|--------|-----------|-----------|-------|-------|
| Ground Truth | – | – | 0.216 | 9.94 | 7.54 | 0.201 | – |
| FACT | 113.38 | 97.05 | 0.284 | 3.36 | 6.37 | 0.183 | 29 |
| MNET | 104.71 | 90.31 | 0.394 | 3.12 | 6.14 | 0.186 | 26 |
| Bailando | 82.81 | 28.17 | 0.188 | 7.74 | 6.25 | 0.202 | 188 |
| EDGE | 94.34 | 50.38 | 0.200 | 8.13 | 6.45 | 0.212 | 119 |
| Lodge | 50.00 | 35.52 | 0.028 | 5.67 | 4.96 | 0.226 | 224 |
| MEGA | 50.00 | 13.02 | 0.243 | 6.23 | 6.27 | 0.226 | 238 |
| **Motion Expert (Full)** | **17.83** | **25.09** | **0.210** | **10.30** | **8.09** | **0.229** | **770** |

**The Motion Expert achieves SOTA performance, with high fidelity, diversity, synchronization, and efficiency (770 FPS).**

**Table 3: Quantitative comparison on MA-Data in Pose-Driven Image Animation**
| Method | FVD ↓ | SSIM ↑ | LPIPS ↓ | PSNR ↑ |
|--------|-------|-------|--------|-------|
| Animate-Anyone | 515.26 | 0.648 | 0.091 | 19.65 |
| Magic-Animate | 1032.06 | 0.311 | 0.207 | 14.00 |
| Wan-Animate | 332.82 | 0.707 | 0.078 | 21.11 |
| w/o. Kinematic stage | 328.91 | 0.596 | 0.107 | 18.69 |
| w/o. Aesthetic stage | 445.93 | 0.563 | 0.121 | 17.89 |
| **Appearance Expert** | **274.94** | **0.739** | **0.066** | **22.40** |

**The Appearance Expert achieves SOTA performance, validated by the two-stage fine-tuning ablation.**

### Qualitative Analysis
- **Effect Comparison**: MACE-Dance generates videos with kinematically plausible and artistically expressive motion, and spatiotemporally coherent appearance, outperforming baselines like Hallo2, EDGE, Lodge, MEGA, WAN-S2V, and Echomimic-V3 (see Fig. 3).
- **Cross-Genre Generation**: Effectively generates distinct genre-specific motions (Uyghur, Dunhuang, Dai, K-Pop, Popping) as shown in Fig. 4.
- **Long-Sequence Generation**: Produces coherent long-sequence dance videos (up to 30 seconds) thanks to the BiMamba–Transformer hybrid and pose-driven relay rendering (see Fig. 6).

### Ablation Studies
**Motion Expert Ablation (Table 2)**:
- **BiMamba → Mamba**: Removes bidirectional context, degrading dance quality metrics (FID$_k$=65.10, FID$_g$=51.74) though efficiency improves (1044 FPS). Generated dances become simpler.
- **BiMamba → Transformer**: Deprives non-autoregressive generation, causing collapse to in-place jitter. Metrics drop severely (FID$_k$=104.93, FID$_g$=114.42).
- **GFT → CFG**: Replacing GFT with classifier-free guidance leads to modest decline in metrics and lower generation efficiency.

**Appearance Expert Ablation (Table 3 & Fig. 7)**:
- **w/o. Kinematic Stage**: Leads to modest metric decline and noticeable kinematic errors/motion blur.
- **w/o. Aesthetic Stage**: Causes substantial degradation and obvious ghosting artifacts.
- The full Appearance Expert outperforms Wan-Animate baseline.

**Motion Representation (2D vs. 3D) (Table 4)**:
| Representation | FID$_k$ ↓ | FID$_g$ ↓ | DIV$_k$ ↑ | DIV$_g$ ↑ | BAS ↑ | AQ ↑ | SC ↑ | FID ↓ | BAS ↑ |
|---------------|-----------|-----------|-----------|-----------|-------|------|------|-------|-------|
| 2D | 22.8 | 8.6 | 6.12 | 5.24 | 0.527 | 51.86 | 91.84 | 23.73 | 0.496 |
| 3D | 19.5 | 4.1 | 8.87 | 5.92 | 0.543 | 51.79 | 93.97 | 16.46 | 0.523 |

**3D representation consistently outperforms 2D across both motion and final video metrics.**

**Role of Each Expert (Cross-Composition Analysis, Table 5)**:
| Method | AQ ↑ | SC ↑ | FID ↓ | BAS ↑ |
|--------|------|------|-------|-------|
| w/o.ME (EDGE + Our AE) | 50.21 | 92.10 | 20.84 | 0.499 |
| w/o.AE (Our ME + Wan-Animate) | 50.36 | 91.42 | 17.92 | 0.519 |
| **Ours (Full MACE-Dance)** | **51.79** | **93.97** | **16.46** | **0.523** |

**Both experts contribute positively; the Motion Expert strengthens music-motion alignment (BAS), and the Appearance Expert improves visual quality (AQ, SC).**

**Comparison with Video Foundation Models (Table 6)**:
| Method | AQ ↑ | SC ↑ | FID ↓ | BAS ↑ |
|--------|------|------|-------|-------|
| CogVideoX1.5-5B | 50.38 | 89.92 | 22.47 | 0.477 |
| WAN2.2-5B | 53.22 | 90.77 | 17.53 | 0.452 |
| **Ours** | **51.79** | **93.97** | **16.46** | **0.523** |

**MACE-Dance achieves better overall performance, particularly in SC, FID, and BAS, indicating stronger music-motion alignment and visual quality.**

## Theoretical and Practical Implications
- **Task Decoupling**: The cascaded expert design effectively isolates motion semantics from visual appearance, reducing the complexity of learning a direct music-to-video mapping. The explicit 3D motion representation suppresses spurious cross-modal correlations and provides an interpretable intermediate interface.
- **Architectural Innovations**: The BiMamba–Transformer hybrid backbone combines local dependency modeling (BiMamba) with global cross-modal context (Transformer), enabling high-quality, non-autoregressive sequence generation.

---

_Markdown view of https://picx.dev/p/rZzfeY, served by PicX — AI-generated visual whiteboard summaries of research papers._
