MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Summary (Overview)

  • Cascaded Expert Framework: Proposes MACE-Dance, a novel framework for music-driven dance video generation that decomposes the task into two cascaded experts: a Motion Expert for generating 3D dance motion from music, and an Appearance Expert for synthesizing high-fidelity video from the motion and a reference image.
  • SOTA Performance: Achieves state-of-the-art (SOTA) performance on the curated MA-Data dataset for the full task, as well as on the FineDance dataset (3D dance generation) and the MA-Data dataset (pose-driven image animation) for its individual expert components.
  • Advanced Motion Generation: The Motion Expert uses a BiMamba-Transformer hybrid architecture within a Diffusion Model and employs Guidance-Free Training (GFT), enabling non-autoregressive, high-quality, and efficient 3D motion generation.
  • Specialized Appearance Synthesis: The Appearance Expert, built upon Wan-Animate, uses a decoupled Kinematic–Aesthetic fine-tuning strategy to adapt the model specifically for the complex patterns in dance videos, enhancing motion adherence and visual quality.
  • New Benchmark Resources: Introduces a large-scale dataset (MA-Data) and a comprehensive motion–appearance evaluation protocol to better benchmark the music-driven dance video generation task.

Introduction and Theoretical Foundation

The task of generating dance videos from music is compelling due to the popularity of online dance platforms and advances in AIGC. However, it faces two core challenges: generating kinematically plausible and artistically expressive dance motions, and achieving high-fidelity visual appearance with strong spatiotemporal consistency.

Existing approaches are not directly transferable:

  • Music-driven 3D dance generation focuses on motion but neglects realistic visual appearance and human-scene interaction.
  • Pose-driven image animation requires manual pose design, which is time-consuming.
  • Audio-driven talking-head synthesis focuses on upper-body gestures, not complex full-body dance motion.
  • Limited prior work on music-driven dance video generation often fails to capture the inherently 3D nature of dance, compromising both motion and appearance quality.

Theoretical Motivation: MACE-Dance addresses these gaps through a cascaded Mixture-of-Experts (MoE) design. This decoupling isolates motion semantics from visual appearance, reducing the complexity of learning a direct music-to-video mapping. Crucially, it uses 3D SMPL parameters as the intermediate representation instead of 2D keypoints, providing richer spatial fidelity, cleaner supervision (view-invariant), and better robustness to occlusion.

Methodology

3.1 Overview

Given a music sequence MRT×CmM \in R^{T \times C_m} and a reference image IRH×W×3I \in R^{H \times W \times 3}, the goal is to synthesize a dance video DRT×H×W×3D \in R^{T \times H \times W \times 3}.

  1. Motion Expert (ME): Transforms MM into a 3D motion sequence XRT×CxX \in R^{T \times C_x} (SMPL parameters).
  2. Appearance Expert (AE): Synthesizes the final video DD conditioned on XX and II.

3.2 Motion Expert

Generative Strategy:

  • Based on DDPM. The forward noising process is defined as: q(ztx)N(αˉtx,(1αˉt)I)(1)q(z_t | x) \sim \mathcal{N}(\sqrt{\bar{\alpha}_t} x, (1 - \bar{\alpha}_t)I) \tag{1}
  • Employs Guidance-Free Training (GFT) instead of Classifier-Free Guidance (CFG). The optimization target xβx_\beta is: xβ=βx^θ(zt,t,c,β)+(1β)sg[x^θ(zt,t,,1)](3)x_\beta = \beta \hat{x}_\theta(z_t, t, c, \beta) + (1 - \beta) \text{sg}[\hat{x}_\theta(z_t, t, \emptyset, 1)] \tag{3} where β[0,1]\beta \in [0,1] is a temperature parameter provided as conditioning. Values near 0 favor fidelity, near 1 favor diversity.
  • The overall training loss L\mathcal{L} is a weighted sum: L=λrecLrec+λjointLjoint+λvelLvel+λfootLfoot(5)\mathcal{L} = \lambda_{\text{rec}}\mathcal{L}_{\text{rec}} + \lambda_{\text{joint}}\mathcal{L}_{\text{joint}} + \lambda_{\text{vel}}\mathcal{L}_{\text{vel}} + \lambda_{\text{foot}}\mathcal{L}_{\text{foot}} \tag{5} with losses for reconstruction, 3D joint positions, velocity, and foot contact.

Model Architecture:

  • BiMamba-Transformer Hybrid Backbone: LmL_m-layer BiMamba processes music features to capture intra-modal local dependencies. The dance generator consists of LdL_d stacked blocks, each containing:
    1. A BiMamba for motion feature processing.
    2. FiLM modulation with a fused tt-β\beta embedding.
    3. A Transformer for cross-modal global context via attention between motion (query) and music (key/value): Attention=softmax(QdKmTC)Vm(8)\text{Attention} = \text{softmax}\left(\frac{Q_d \cdot K_m^T}{\sqrt{C}}\right) V_m \tag{8}
    4. A second FiLM layer.
  • Selective State Space Model (Mamba): For a time step tt, the state evolves as: ht=Aˉtht1+Bˉtxt,yt=Ctht(6)h_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t, \quad y_t = C_t h_t \tag{6} with discretized parameters Aˉ\bar{A} and Bˉ\bar{B}.

3.3 Appearance Expert

Built upon Wan-Animate, with a 3D-to-2D Motion Projector to convert the SMPL sequence XX into 2D keypoints for conditioning.

Decoupled Fine-tuning Strategy:

  1. Kinematic Stage: Fine-tunes only the Body Adapter to strengthen kinematic conditioning and motion adherence for dance.
  2. Aesthetic Stage: Freezes kinematic pathways and attaches LoRA adapters to each DiT block for parameter-efficient aesthetic refinement. For a pre-trained weight matrix W0W_0, LoRA updates it as: W=W0+ΔW=W0+AB(9)W = W_0 + \Delta W = W_0 + AB \tag{9} where ARm×rA \in R^{m \times r} and BRr×nB \in R^{r \times n} are low-rank matrices (rm,nr \ll m, n).

Empirical Validation / Results

4.1 Dataset: MA-Data

A large-scale dataset curated for this task.

  • Size: 70k clips (5-10s each), totaling 116 hours, spanning over 20 dance genres.
  • Composition:
    • 3D-rendered data (20k clips, 28h): Motion-centric, derived from FineDance.
    • In-the-wild internet data (50k clips, 88h): Appearance-centric, from TikTok/YouTube.

4.2 Evaluation Protocol

A motion–appearance protocol:

  • Motion Dimension: Evaluated on 2D keypoints extracted by ViTPose.
    • Fidelity/Diversity: FID and DIV in kinetic (k) and geometric (g) feature spaces.
    • Synchronization: Beat Alignment Score (BAS).
  • Appearance Dimension: Uses selected VBench metrics.
    • Quality: Imaging Quality (IQ), Aesthetic Quality (AQ).
    • Consistency: Subject Consistency (SC), Background Consistency (BC).
    • Temporal Quality: Motion Smoothness (MS), Temporal Flickering (TF).

4.3 Quantitative Comparisons

Table 1: Music-Driven Dance Video Generation (MA-Data)

MethodIQ ↑AQ ↑SC ↑BC ↑MS ↑TF ↑FIDk_kFIDg_gDIVk_kDIVg_gBAS ↑
Ground Truth67.1253.5191.8692.9798.2096.889.245.310.526
Hallo262.6450.7992.4893.8498.3096.5616.551.298.115.470.505
WAN-S2V64.1050.2092.3093.4098.2096.7018.901.457.605.440.485
Echomimic-V363.2049.0091.9093.1098.0596.4019.601.327.204.600.460
EDGE63.0549.7091.7993.3098.6497.1021.771.399.085.740.498
Lodge63.6949.2291.6792.9898.4697.0518.731.498.875.710.474
MEGA66.1449.8992.9594.1397.4596.3218.981.658.785.590.513
MACE-Dance65.3551.7993.9794.5798.4697.1016.460.289.746.340.523

MACE-Dance achieves SOTA performance on most metrics, particularly excelling in motion quality (low FID, high DIV/BAS) and subject/background consistency.

Table 2: Music-Driven 3D Dance Generation (FineDance)

MethodFIDk_kFIDg_gFSR ↓DIVk_kDIVg_gBAS ↑FPS ↑
Ground Truth0.2169.947.540.201
FACT113.3897.050.2843.366.370.18329
Bailando82.8128.170.1887.746.250.202188
EDGE94.3450.380.2008.136.450.212119
Lodge50.0035.520.0285.674.960.226224
MEGA50.0013.020.2436.236.270.226238
Motion Expert (Full)17.8325.090.21010.308.090.229770

The Motion Expert achieves SOTA or competitive results, with notably high diversity and generation efficiency (FPS).

Table 3: Pose-Driven Image Animation (MA-Data)

MethodFVD ↓SSIM ↑LPIPS ↓PSNR ↑
Animate-Anyone515.260.6480.09119.65
Magic-Animate1032.060.3110.20714.00
Wan-Animate332.820.7070.07821.11
w/o. Kinematic stage328.910.5960.10718.69
w/o. Aesthetic stage445.930.5630.12117.89
Appearance Expert274.940.7390.06622.40

The Appearance Expert with the full fine-tuning strategy achieves SOTA performance.

4.4 Qualitative Analysis & Ablation Studies

  • Qualitative Superiority: MACE-Dance generates videos with kinematically plausible, expressive motion and spatiotemporally coherent appearance, outperforming baselines that show blur, artifacts, or simplistic motion (Fig. 3).
  • Cross-Genre & Long-Sequence Generation: Effectively generates distinct genre-specific motions (Uyghur, Dunhuang, Dai, K-Pop, Popping) and coherent long sequences (Fig. 4, 6).
  • Ablation - Motion Expert Architecture (Tab. 2):
    • BiMamba → Mamba: Degrades dance quality metrics.
    • BiMamba → Transformer: Collapses to poor motion, though some metrics (BAS, FSR) artificially increase.
    • GFT → CFG: Leads to decline in most metrics and reduces generation efficiency.
  • Ablation - Appearance Expert Stages (Tab. 3, Fig. 7):
    • Removing the Kinematic Stage causes kinematic errors and motion blur.
    • Removing the Aesthetic Stage leads to substantial degradation and ghosting artifacts.
  • 3D vs. 2D Motion Representation (Table 4): | Representation | FIDk_k ↓ | FIDg_g ↓ | DIVk_k ↑ | DIVg_g ↑ | BAS ↑ | AQ ↑ | SC ↑ | FID ↓ | BAS ↑ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | 2D | 22.8 | 8.6 | 6.12 | 5.24 | 0.527 | 51.86 | 91.84 | 23.73 | 0.496 | | 3D | 19.5 | 4.1 | 8.87 | 5.92 | 0.543 | 51.79 | 93.97 | 16.46 | 0.523 |

The 3D representation consistently outperforms 2D across both motion and final video metrics.

  • Role of Each Expert (Table 5): Cross-composition shows both experts contribute positively, with the Motion Expert strengthening motion alignment (BAS) and the Appearance Expert improving visual quality (AQ, SC).

Theoretical and Practical Implications

Theoretical Implications:

  • Validates the effectiveness of a cascaded, task-decoupled approach for complex video generation tasks, isolating motion semantics from appearance.
  • Demonstrates the superiority of 3D motion as an interpretable, robust intermediate representation over 2D keypoints for bridging modalities.
  • Shows the advantages of **hybrid architectures (BiMamba.
  • Shows the advantages of hybrid architectures (BiMamba-Transformer) and training strategies (GFT) for generating long, coherent, and expressive sequences in diffusion models.

Practical Implications:

  • Provides a practical framework for automating dance video creation, which is highly relevant for content creation on platforms like TikTok and YouTube.
  • Introduces a new benchmark (MA-Data dataset and evaluation protocol) that will facilitate future research in this domain.
  • The efficient non-autoregressive generation and parameter-efficient fine-tuning strategies make the approach more feasible for real-world applications.

Conclusion

MACE-Dance presents a novel and effective framework for music-driven dance video generation by cascading a Motion Expert and an Appearance Expert. The Motion Expert, with its BiMamba-Transformer hybrid architecture and GFT strategy, generates high-quality 3D dance motion. The Appearance Expert, via a decoupled Kinematic–Aesthetic fine-tuning strategy, synthesizes visually coherent and high-fidelity videos. The framework is supported by a new large-scale dataset and evaluation protocol, on which it achieves state-of-the-art performance.

Future Work: Extending the framework with textual descriptions for more interactive generation and improving system-level efficiency for low-latency authoring and real-time feedback.