MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

Summary (Overview)

  • Framework: Introduces MACE-Dance, a cascaded Mixture-of-Experts (MoE) framework for music-driven dance video generation, decoupling motion and appearance synthesis.
  • Motion Expert: Uses a Diffusion Model with a BiMamba-Transformer hybrid architecture and Guidance-Free Training (GFT) strategy to generate kinematically plausible and artistically expressive 3D dance motion from music.
  • Appearance Expert: Adopts a decoupled Kinematic–Aesthetic fine-tuning strategy on Wan-Animate to synthesize high-fidelity videos from 3D motion and a reference image.
  • Dataset & Protocol: Curates a large-scale dataset (MA-Data) and introduces a motion–appearance evaluation protocol to benchmark the task.
  • Performance: Achieves state-of-the-art (SOTA) performance in music-driven dance video generation, as well as in the subtasks of music-driven 3D dance generation and pose-driven image animation.

Introduction and Theoretical Foundation

Music-driven dance video generation is a timely research direction with the rise of online dance platforms and advances in AIGC. The task faces two key challenges: (1) generating kinematically plausible and artistically expressive dance motions, and (2) achieving high-fidelity visual appearance with strong spatiotemporal consistency. Existing approaches from related domains (music-driven 3D dance generation, pose-driven image animation, audio-driven talking-head synthesis) are not readily transferable due to fundamental mismatches. Research on music-driven dance video generation itself remains limited and often fails to capture the inherently 3D nature of dance, compromising motion and appearance quality.

MACE-Dance addresses these challenges through a cascaded expert design. The Motion Expert generates 3D motion from music, enforcing kinematic plausibility and artistic expressiveness. The Appearance Expert synthesizes videos conditioned on this 3D motion and a reference image, preserving visual identity and spatiotemporal coherence. Crucially, the framework uses 3D SMPL parameters as the intermediate representation, rather than 2D keypoints, for three reasons:

  1. Richer spatial fidelity: Preserves full-body geometric structure, including global translation and orientation.
  2. Cleaner supervision: Disentangles pose from camera viewpoint and subject-specific appearance.
  3. Better robustness: More robust to self-occlusion and viewpoint variation.

Methodology

The overall objective is to synthesize dance videos DRT×H×W×3D \in R^{T \times H \times W \times 3} given a music sequence MRT×CmM \in R^{T \times C_m} and a reference image IRH×W×3I \in R^{H \times W \times 3}.

Motion Expert

Generative Strategy: Uses a Diffusion Model with Guidance-Free Training (GFT).

  • The forward noising process in DDPM is defined as:
q(ztx)N(αˉtx,(1αˉt)I)q(z_t | x) \sim \mathcal{N}(\sqrt{\bar{\alpha}_t} x, (1 - \bar{\alpha}_t)I)

where αˉt(0,1)\bar{\alpha}_t \in (0,1) are constants following a monotonically decreasing schedule.

  • With music conditioning cc, the model learns to estimate x^θ(zt,t,c)x\hat{x}_\theta(z_t, t, c) \approx x. GFT establishes xβx_\beta as the new optimization target:
xβ=βx^θ(zt,t,c,β)+(1β)sg[x^θ(zt,t,,1)]x_\beta = \beta \hat{x}_\theta(z_t, t, c, \beta) + (1 - \beta) \text{sg}[\hat{x}_\theta(z_t, t, \emptyset, 1)]

where \emptyset denotes unconditional setting, sg\text{sg} is stop-gradient, and β[0,1]\beta \in [0,1] is a temperature parameter.

  • The training loss combines reconstruction, 3D joint, velocity, and foot contact losses:
L=λrecLrec+λjointLjoint+λvelLvel+λfootLfoot\mathcal{L} = \lambda_{\text{rec}}\mathcal{L}_{\text{rec}} + \lambda_{\text{joint}}\mathcal{L}_{\text{joint}} + \lambda_{\text{vel}}\mathcal{L}_{\text{vel}} + \lambda_{\text{foot}}\mathcal{L}_{\text{foot}}

where:

Lrec=E[xβx22],Ljoint=E[FK(xβ)FK(x)22],\mathcal{L}_{\text{rec}} = \mathbb{E}[\| x_\beta - x \|_2^2], \quad \mathcal{L}_{\text{joint}} = \mathbb{E}[\| FK(x_\beta) - FK(x) \|_2^2], Lvel=E[FK(xβ)FK(x)22],Lfoot=E[FK(xβ)b^22]\mathcal{L}_{\text{vel}} = \mathbb{E}[\| FK(x_\beta)' - FK(x)' \|_2^2], \quad \mathcal{L}_{\text{foot}} = \mathbb{E}[\| FK(x_\beta)' \cdot \hat{b} \|_2^2]

FK()FK(\cdot) is the forward kinematic function, and b^\hat{b} is the predicted binary foot contact label.

Model Architecture: Adopts a BiMamba–Transformer hybrid backbone.

  • BiMamba captures intra-modal local dependencies in music or dance. The Selective State Space Model (Mamba) dynamics are:
ht=Aˉtht1+Bˉtxt,yt=Cthth_t = \bar{A}_t h_{t-1} + \bar{B}_t x_t, \quad y_t = C_t h_t

where Aˉt,Bˉt,Ct\bar{A}_t, \bar{B}_t, C_t are dynamically updated. The state transitions are:

Aˉ=exp(ΔA),Bˉ=(ΔA)1(exp(ΔA)I)ΔB\bar{A} = \exp(\Delta A), \quad \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B
  • Transformer models cross-modal global context via cross-attention:
Attention=softmax(QdKmTC)Vm\text{Attention} = \text{softmax}\left(\frac{Q_d \cdot K_m^T}{\sqrt{C}}\right) V_m

where motion features are queries and music features provide keys/values.

  • The architecture enables non-autoregressive generation of entire sequences, improving efficiency and avoiding exposure bias.

Appearance Expert

Built upon Wan-Animate, with a decoupled Kinematic–Aesthetic fine-tuning strategy to adapt it for dance video generation.

Model Architecture:

  • A 3D-to-2D Motion Projector converts the SMPL sequence from the Motion Expert into 2D keypoints (using pyrender and ViTPose) for Wan-Animate.
  • Kinematic Stage: Fine-tunes only the Body Adapter (freezing other components) to strengthen kinematic conditioning and motion adherence.
  • Aesthetic Stage: Attaches LoRA adapters to each DiT block for aesthetic refinement, enhancing texture fidelity and stylistic consistency while preserving pretrained priors. LoRA updates the weight matrix W0W_0 as:
W=W0+ΔW=W0+ABW = W_0 + \Delta W = W_0 + AB

where ARm×rA \in R^{m \times r} and BRr×nB \in R^{r \times n} are low-rank matrices (rm,nr \ll m, n).

Empirical Validation / Results

Dataset: MA-Data

A large-scale dance video dataset curated for benchmarking.

  • Size: 70k clips of 5–10 seconds each (116 hours total), spanning over 20 dance genres.
  • Composition:
    1. 3D-rendered data (motion-centric): 20k clips (28 hours) derived from FineDance, rendered from 3D professional dancer motions.
    2. In-the-wild internet data (appearance-centric): 50k clips (88 hours) collected from TikTok/YouTube, emphasizing visual appearance.
  • Test set: 200 5-second clips from high-engagement videos across multiple genres.

Evaluation Protocol

A motion–appearance evaluation protocol.

  • Motion dimension: Assess fidelity, diversity, and synchronization from a Human-Kinematics perspective using 2D keypoints extracted by ViTPose.
    • Metrics: FIDk_{k}, FIDg_{g}, DIVk_{k}, DIVg_{g}, Beat Alignment Score (BAS).
  • Appearance dimension: Uses VBench dance-specific metrics.
    • Metrics: Imaging Quality (IQ), Aesthetic Quality (AQ), Subject Consistency (SC), Background Consistency (BC), Motion Smoothness (MS), Temporal Flickering (TF).

Quantitative Results

Table 1: Quantitative comparison on MA-Data in Music-Driven Dance Video Generation

MethodIQ ↑AQ ↑SC ↑BC ↑MS ↑TF ↑FIDk_kFIDg_gDIVk_kDIVg_gBAS ↑
Ground Truth67.1253.5191.8692.9798.2096.889.245.310.526
Hallo262.6450.7992.4893.8498.3096.5616.551.298.115.470.505
WAN-S2V64.1050.2092.3093.4098.2096.7018.901.457.605.440.485
Echomimic-V363.2049.0091.9093.1098.0596.4019.601.327.204.600.460
EDGE63.0549.7091.7993.3098.6497.1021.771.399.085.740.498
Lodge63.6949.2291.6792.9898.4697.0518.731.498.875.710.474
MEGA66.1449.8992.9594.1397.4596.3218.981.658.785.590.513
MACE-Dance65.3551.7993.9794.5798.4697.1016.460.289.746.340.523

MACE-Dance achieves SOTA performance on both motion and appearance metrics.

Table 2: Quantitative comparison on FineDance in Music-Driven 3D Dance Generation

MethodFIDk_kFIDg_gFSR ↓DIVk_kDIVg_gBAS ↑FPS ↑
Ground Truth0.2169.947.540.201
FACT113.3897.050.2843.366.370.18329
MNET104.7190.310.3943.126.140.18626
Bailando82.8128.170.1887.746.250.202188
EDGE94.3450.380.2008.136.450.212119
Lodge50.0035.520.0285.674.960.226224
MEGA50.0013.020.2436.236.270.226238
Motion Expert (Full)17.8325.090.21010.308.090.229770

The Motion Expert achieves SOTA performance, with high fidelity, diversity, synchronization, and efficiency (770 FPS).

Table 3: Quantitative comparison on MA-Data in Pose-Driven Image Animation

MethodFVD ↓SSIM ↑LPIPS ↓PSNR ↑
Animate-Anyone515.260.6480.09119.65
Magic-Animate1032.060.3110.20714.00
Wan-Animate332.820.7070.07821.11
w/o. Kinematic stage328.910.5960.10718.69
w/o. Aesthetic stage445.930.5630.12117.89
Appearance Expert274.940.7390.06622.40

The Appearance Expert achieves SOTA performance, validated by the two-stage fine-tuning ablation.

Qualitative Analysis

  • Effect Comparison: MACE-Dance generates videos with kinematically plausible and artistically expressive motion, and spatiotemporally coherent appearance, outperforming baselines like Hallo2, EDGE, Lodge, MEGA, WAN-S2V, and Echomimic-V3 (see Fig. 3).
  • Cross-Genre Generation: Effectively generates distinct genre-specific motions (Uyghur, Dunhuang, Dai, K-Pop, Popping) as shown in Fig. 4.
  • Long-Sequence Generation: Produces coherent long-sequence dance videos (up to 30 seconds) thanks to the BiMamba–Transformer hybrid and pose-driven relay rendering (see Fig. 6).

Ablation Studies

Motion Expert Ablation (Table 2):

  • BiMamba → Mamba: Removes bidirectional context, degrading dance quality metrics (FIDk_k=65.10, FIDg_g=51.74) though efficiency improves (1044 FPS). Generated dances become simpler.
  • BiMamba → Transformer: Deprives non-autoregressive generation, causing collapse to in-place jitter. Metrics drop severely (FIDk_k=104.93, FIDg_g=114.42).
  • GFT → CFG: Replacing GFT with classifier-free guidance leads to modest decline in metrics and lower generation efficiency.

Appearance Expert Ablation (Table 3 & Fig. 7):

  • w/o. Kinematic Stage: Leads to modest metric decline and noticeable kinematic errors/motion blur.
  • w/o. Aesthetic Stage: Causes substantial degradation and obvious ghosting artifacts.
  • The full Appearance Expert outperforms Wan-Animate baseline.

Motion Representation (2D vs. 3D) (Table 4):

RepresentationFIDk_kFIDg_gDIVk_kDIVg_gBAS ↑AQ ↑SC ↑FID ↓BAS ↑
2D22.88.66.125.240.52751.8691.8423.730.496
3D19.54.18.875.920.54351.7993.9716.460.523

3D representation consistently outperforms 2D across both motion and final video metrics.

Role of Each Expert (Cross-Composition Analysis, Table 5):

MethodAQ ↑SC ↑FID ↓BAS ↑
w/o.ME (EDGE + Our AE)50.2192.1020.840.499
w/o.AE (Our ME + Wan-Animate)50.3691.4217.920.519
Ours (Full MACE-Dance)51.7993.9716.460.523

Both experts contribute positively; the Motion Expert strengthens music-motion alignment (BAS), and the Appearance Expert improves visual quality (AQ, SC).

Comparison with Video Foundation Models (Table 6):

MethodAQ ↑SC ↑FID ↓BAS ↑
CogVideoX1.5-5B50.3889.9222.470.477
WAN2.2-5B53.2290.7717.530.452
Ours51.7993.9716.460.523

MACE-Dance achieves better overall performance, particularly in SC, FID, and BAS, indicating stronger music-motion alignment and visual quality.

Theoretical and Practical Implications

  • Task Decoupling: The cascaded expert design effectively isolates motion semantics from visual appearance, reducing the complexity of learning a direct music-to-video mapping. The explicit 3D motion representation suppresses spurious cross-modal correlations and provides an interpretable intermediate interface.
  • Architectural Innovations: The BiMamba–Transformer hybrid backbone combines local dependency modeling (BiMamba) with global cross-modal context (Transformer), enabling high-quality, non-autoregressive sequence generation.

Related papers