MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Summary (Overview)

Cascaded Expert Framework: Proposes MACE-Dance, a novel framework for music-driven dance video generation that decomposes the task into two cascaded experts: a Motion Expert for generating 3D dance motion from music, and an Appearance Expert for synthesizing high-fidelity video from the motion and a reference image.
SOTA Performance: Achieves state-of-the-art (SOTA) performance on the curated MA-Data dataset for the full task, as well as on the FineDance dataset (3D dance generation) and the MA-Data dataset (pose-driven image animation) for its individual expert components.
Advanced Motion Generation: The Motion Expert uses a BiMamba-Transformer hybrid architecture within a Diffusion Model and employs Guidance-Free Training (GFT), enabling non-autoregressive, high-quality, and efficient 3D motion generation.
Specialized Appearance Synthesis: The Appearance Expert, built upon Wan-Animate, uses a decoupled Kinematic–Aesthetic fine-tuning strategy to adapt the model specifically for the complex patterns in dance videos, enhancing motion adherence and visual quality.
New Benchmark Resources: Introduces a large-scale dataset (MA-Data) and a comprehensive motion–appearance evaluation protocol to better benchmark the music-driven dance video generation task.

Introduction and Theoretical Foundation

The task of generating dance videos from music is compelling due to the popularity of online dance platforms and advances in AIGC. However, it faces two core challenges: generating kinematically plausible and artistically expressive dance motions, and achieving high-fidelity visual appearance with strong spatiotemporal consistency.

Existing approaches are not directly transferable:

Music-driven 3D dance generation focuses on motion but neglects realistic visual appearance and human-scene interaction.
Pose-driven image animation requires manual pose design, which is time-consuming.
Audio-driven talking-head synthesis focuses on upper-body gestures, not complex full-body dance motion.
Limited prior work on music-driven dance video generation often fails to capture the inherently 3D nature of dance, compromising both motion and appearance quality.

Theoretical Motivation: MACE-Dance addresses these gaps through a cascaded Mixture-of-Experts (MoE) design. This decoupling isolates motion semantics from visual appearance, reducing the complexity of learning a direct music-to-video mapping. Crucially, it uses 3D SMPL parameters as the intermediate representation instead of 2D keypoints, providing richer spatial fidelity, cleaner supervision (view-invariant), and better robustness to occlusion.

Methodology

3.1 Overview

Given a music sequence $M \in R^{T \times C_m}$ and a reference image $I \in R^{H \times W \times 3}$ , the goal is to synthesize a dance video $D \in R^{T \times H \times W \times 3}$ .

Motion Expert (ME): Transforms $M$ into a 3D motion sequence $X \in R^{T \times C_x}$ (SMPL parameters).
Appearance Expert (AE): Synthesizes the final video $D$ conditioned on $X$ and $I$ .

3.2 Motion Expert

Generative Strategy:

Based on DDPM. The forward noising process is defined as: $q(z_t | x) \sim \mathcal{N}(\sqrt{\bar{\alpha}_t} x, (1 - \bar{\alpha}_t)I) \tag{1}$
Employs Guidance-Free Training (GFT) instead of Classifier-Free Guidance (CFG). The optimization target $x_\beta$ is: $x_\beta = \beta \hat{x}_\theta(z_t, t, c, \beta) + (1 - \beta) \text{sg}[\hat{x}_\theta(z_t, t, \emptyset, 1)] \tag{3}$ where $\beta \in [0,1]$ is a temperature parameter provided as conditioning. Values near 0 favor fidelity, near 1 favor diversity.
The overall training loss $\mathcal{L}$ is a weighted sum: $\mathcal{L} = \lambda_{\text{rec}}\mathcal{L}_{\text{rec}} + \lambda_{\text{joint}}\mathcal{L}_{\text{joint}} + \lambda_{\text{vel}}\mathcal{L}_{\text{vel}} + \lambda_{\text{foot}}\mathcal{L}_{\text{foot}} \tag{5}$ with losses for reconstruction, 3D joint positions, velocity, and foot contact.

Model Architecture:

BiMamba-Transformer Hybrid Backbone: $L_m$ $L_{m}$ -layer BiMamba processes music features to capture intra-modal local dependencies. The dance generator consists of $L_d$ $L_{d}$ stacked blocks, each containing:
1. A BiMamba for motion feature processing.
2. FiLM modulation with a fused $t$ - $\beta$ embedding.
3. A Transformer for cross-modal global context via attention between motion (query) and music (key/value): $\text{Attention} = \text{softmax}\left(\frac{Q_d \cdot K_m^T}{\sqrt{C}}\right) V_m \tag{8}$
4. A second FiLM layer.
Selective State Space Model (Mamba): For a time step $t$ , the state evolves as: $h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t, \quad y_t = C_t h_t \tag{6}$ with discretized parameters $\bar{A}$ and $\bar{B}$ .

3.3 Appearance Expert

Built upon Wan-Animate, with a 3D-to-2D Motion Projector to convert the SMPL sequence $X$ into 2D keypoints for conditioning.

Decoupled Fine-tuning Strategy:

Kinematic Stage: Fine-tunes only the Body Adapter to strengthen kinematic conditioning and motion adherence for dance.
Aesthetic Stage: Freezes kinematic pathways and attaches LoRA adapters to each DiT block for parameter-efficient aesthetic refinement. For a pre-trained weight matrix $W_0$ , LoRA updates it as: $W = W_0 + \Delta W = W_0 + AB \tag{9}$ where $A \in R^{m \times r}$ and $B \in R^{r \times n}$ are low-rank matrices ( $r \ll m, n$ ).

Empirical Validation / Results

4.1 Dataset: MA-Data

A large-scale dataset curated for this task.

Size: 70k clips (5-10s each), totaling 116 hours, spanning over 20 dance genres.
Composition:
- 3D-rendered data (20k clips, 28h): Motion-centric, derived from FineDance.
- In-the-wild internet data (50k clips, 88h): Appearance-centric, from TikTok/YouTube.

4.2 Evaluation Protocol

A motion–appearance protocol:

Motion Dimension: Evaluated on 2D keypoints extracted by ViTPose.
- Fidelity/Diversity: FID and DIV in kinetic (k) and geometric (g) feature spaces.
- Synchronization: Beat Alignment Score (BAS).
Appearance Dimension: Uses selected VBench metrics.
- Quality: Imaging Quality (IQ), Aesthetic Quality (AQ).
- Consistency: Subject Consistency (SC), Background Consistency (BC).
- Temporal Quality: Motion Smoothness (MS), Temporal Flickering (TF).

4.3 Quantitative Comparisons

Table 1: Music-Driven Dance Video Generation (MA-Data)

Method	IQ ↑	AQ ↑	SC ↑	BC ↑	MS ↑	TF ↑	FID $_k$ ↓	FID $_g$ ↓	DIV $_k$ ↑	DIV $_g$ ↑	BAS ↑
Ground Truth	67.12	53.51	91.86	92.97	98.20	96.88	–	–	9.24	5.31	0.526
Hallo2	62.64	50.79	92.48	93.84	98.30	96.56	16.55	1.29	8.11	5.47	0.505
WAN-S2V	64.10	50.20	92.30	93.40	98.20	96.70	18.90	1.45	7.60	5.44	0.485
Echomimic-V3	63.20	49.00	91.90	93.10	98.05	96.40	19.60	1.32	7.20	4.60	0.460
EDGE	63.05	49.70	91.79	93.30	98.64	97.10	21.77	1.39	9.08	5.74	0.498
Lodge	63.69	49.22	91.67	92.98	98.46	97.05	18.73	1.49	8.87	5.71	0.474
MEGA	66.14	49.89	92.95	94.13	97.45	96.32	18.98	1.65	8.78	5.59	0.513
MACE-Dance	65.35	51.79	93.97	94.57	98.46	97.10	16.46	0.28	9.74	6.34	0.523

MACE-Dance achieves SOTA performance on most metrics, particularly excelling in motion quality (low FID, high DIV/BAS) and subject/background consistency.

Table 2: Music-Driven 3D Dance Generation (FineDance)

Method	FID $_k$ ↓	FID $_g$ ↓	FSR ↓	DIV $_k$ ↑	DIV $_g$ ↑	BAS ↑	FPS ↑
Ground Truth	–	–	0.216	9.94	7.54	0.201	–
FACT	113.38	97.05	0.284	3.36	6.37	0.183	29
Bailando	82.81	28.17	0.188	7.74	6.25	0.202	188
EDGE	94.34	50.38	0.200	8.13	6.45	0.212	119
Lodge	50.00	35.52	0.028	5.67	4.96	0.226	224
MEGA	50.00	13.02	0.243	6.23	6.27	0.226	238
Motion Expert (Full)	17.83	25.09	0.210	10.30	8.09	0.229	770

The Motion Expert achieves SOTA or competitive results, with notably high diversity and generation efficiency (FPS).

Table 3: Pose-Driven Image Animation (MA-Data)

Method	FVD ↓	SSIM ↑	LPIPS ↓	PSNR ↑
Animate-Anyone	515.26	0.648	0.091	19.65
Magic-Animate	1032.06	0.311	0.207	14.00
Wan-Animate	332.82	0.707	0.078	21.11
w/o. Kinematic stage	328.91	0.596	0.107	18.69
w/o. Aesthetic stage	445.93	0.563	0.121	17.89
Appearance Expert	274.94	0.739	0.066	22.40

The Appearance Expert with the full fine-tuning strategy achieves SOTA performance.

4.4 Qualitative Analysis & Ablation Studies

Qualitative Superiority: MACE-Dance generates videos with kinematically plausible, expressive motion and spatiotemporally coherent appearance, outperforming baselines that show blur, artifacts, or simplistic motion (Fig. 3).
Cross-Genre & Long-Sequence Generation: Effectively generates distinct genre-specific motions (Uyghur, Dunhuang, Dai, K-Pop, Popping) and coherent long sequences (Fig. 4, 6).
Ablation - Motion Expert Architecture (Tab. 2):
- BiMamba → Mamba: Degrades dance quality metrics.
- BiMamba → Transformer: Collapses to poor motion, though some metrics (BAS, FSR) artificially increase.
- GFT → CFG: Leads to decline in most metrics and reduces generation efficiency.
Ablation - Appearance Expert Stages (Tab. 3, Fig. 7):
- Removing the Kinematic Stage causes kinematic errors and motion blur.
- Removing the Aesthetic Stage leads to substantial degradation and ghosting artifacts.
3D vs. 2D Motion Representation (Table 4): | Representation | FID $_k$ ↓ | FID $_g$ ↓ | DIV $_k$ ↑ | DIV $_g$ ↑ | BAS ↑ | AQ ↑ | SC ↑ | FID ↓ | BAS ↑ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | 2D | 22.8 | 8.6 | 6.12 | 5.24 | 0.527 | 51.86 | 91.84 | 23.73 | 0.496 | | 3D | 19.5 | 4.1 | 8.87 | 5.92 | 0.543 | 51.79 | 93.97 | 16.46 | 0.523 |

The 3D representation consistently outperforms 2D across both motion and final video metrics.

Role of Each Expert (Table 5): Cross-composition shows both experts contribute positively, with the Motion Expert strengthening motion alignment (BAS) and the Appearance Expert improving visual quality (AQ, SC).

Theoretical and Practical Implications

Theoretical Implications:

Validates the effectiveness of a cascaded, task-decoupled approach for complex video generation tasks, isolating motion semantics from appearance.
Demonstrates the superiority of 3D motion as an interpretable, robust intermediate representation over 2D keypoints for bridging modalities.
Shows the advantages of **hybrid architectures (BiMamba.
Shows the advantages of hybrid architectures (BiMamba-Transformer) and training strategies (GFT) for generating long, coherent, and expressive sequences in diffusion models.

Practical Implications:

Provides a practical framework for automating dance video creation, which is highly relevant for content creation on platforms like TikTok and YouTube.
Introduces a new benchmark (MA-Data dataset and evaluation protocol) that will facilitate future research in this domain.
The efficient non-autoregressive generation and parameter-efficient fine-tuning strategies make the approach more feasible for real-world applications.

Conclusion

MACE-Dance presents a novel and effective framework for music-driven dance video generation by cascading a Motion Expert and an Appearance Expert. The Motion Expert, with its BiMamba-Transformer hybrid architecture and GFT strategy, generates high-quality 3D dance motion. The Appearance Expert, via a decoupled Kinematic–Aesthetic fine-tuning strategy, synthesizes visually coherent and high-fidelity videos. The framework is supported by a new large-scale dataset and evaluation protocol, on which it achieves state-of-the-art performance.

Future Work: Extending the framework with textual descriptions for more interactive generation and improving system-level efficiency for low-latency authoring and real-time feedback.