MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation
Summary (Overview)
- Framework: Introduces MACE-Dance, a cascaded Mixture-of-Experts (MoE) framework for music-driven dance video generation, decoupling motion and appearance synthesis.
- Motion Expert: Uses a Diffusion Model with a BiMamba-Transformer hybrid architecture and Guidance-Free Training (GFT) strategy to generate kinematically plausible and artistically expressive 3D dance motion from music.
- Appearance Expert: Adopts a decoupled Kinematic–Aesthetic fine-tuning strategy on Wan-Animate to synthesize high-fidelity videos from 3D motion and a reference image.
- Dataset & Protocol: Curates a large-scale dataset (MA-Data) and introduces a motion–appearance evaluation protocol to benchmark the task.
- Performance: Achieves state-of-the-art (SOTA) performance in music-driven dance video generation, as well as in the subtasks of music-driven 3D dance generation and pose-driven image animation.
Introduction and Theoretical Foundation
Music-driven dance video generation is a timely research direction with the rise of online dance platforms and advances in AIGC. The task faces two key challenges: (1) generating kinematically plausible and artistically expressive dance motions, and (2) achieving high-fidelity visual appearance with strong spatiotemporal consistency. Existing approaches from related domains (music-driven 3D dance generation, pose-driven image animation, audio-driven talking-head synthesis) are not readily transferable due to fundamental mismatches. Research on music-driven dance video generation itself remains limited and often fails to capture the inherently 3D nature of dance, compromising motion and appearance quality.
MACE-Dance addresses these challenges through a cascaded expert design. The Motion Expert generates 3D motion from music, enforcing kinematic plausibility and artistic expressiveness. The Appearance Expert synthesizes videos conditioned on this 3D motion and a reference image, preserving visual identity and spatiotemporal coherence. Crucially, the framework uses 3D SMPL parameters as the intermediate representation, rather than 2D keypoints, for three reasons:
- Richer spatial fidelity: Preserves full-body geometric structure, including global translation and orientation.
- Cleaner supervision: Disentangles pose from camera viewpoint and subject-specific appearance.
- Better robustness: More robust to self-occlusion and viewpoint variation.
Methodology
The overall objective is to synthesize dance videos given a music sequence and a reference image .
Motion Expert
Generative Strategy: Uses a Diffusion Model with Guidance-Free Training (GFT).
- The forward noising process in DDPM is defined as:
where are constants following a monotonically decreasing schedule.
- With music conditioning , the model learns to estimate . GFT establishes as the new optimization target:
where denotes unconditional setting, is stop-gradient, and is a temperature parameter.
- The training loss combines reconstruction, 3D joint, velocity, and foot contact losses:
where:
is the forward kinematic function, and is the predicted binary foot contact label.
Model Architecture: Adopts a BiMamba–Transformer hybrid backbone.
- BiMamba captures intra-modal local dependencies in music or dance. The Selective State Space Model (Mamba) dynamics are:
where are dynamically updated. The state transitions are:
- Transformer models cross-modal global context via cross-attention:
where motion features are queries and music features provide keys/values.
- The architecture enables non-autoregressive generation of entire sequences, improving efficiency and avoiding exposure bias.
Appearance Expert
Built upon Wan-Animate, with a decoupled Kinematic–Aesthetic fine-tuning strategy to adapt it for dance video generation.
Model Architecture:
- A 3D-to-2D Motion Projector converts the SMPL sequence from the Motion Expert into 2D keypoints (using pyrender and ViTPose) for Wan-Animate.
- Kinematic Stage: Fine-tunes only the Body Adapter (freezing other components) to strengthen kinematic conditioning and motion adherence.
- Aesthetic Stage: Attaches LoRA adapters to each DiT block for aesthetic refinement, enhancing texture fidelity and stylistic consistency while preserving pretrained priors. LoRA updates the weight matrix as:
where and are low-rank matrices ().
Empirical Validation / Results
Dataset: MA-Data
A large-scale dance video dataset curated for benchmarking.
- Size: 70k clips of 5–10 seconds each (116 hours total), spanning over 20 dance genres.
- Composition:
- 3D-rendered data (motion-centric): 20k clips (28 hours) derived from FineDance, rendered from 3D professional dancer motions.
- In-the-wild internet data (appearance-centric): 50k clips (88 hours) collected from TikTok/YouTube, emphasizing visual appearance.
- Test set: 200 5-second clips from high-engagement videos across multiple genres.
Evaluation Protocol
A motion–appearance evaluation protocol.
- Motion dimension: Assess fidelity, diversity, and synchronization from a Human-Kinematics perspective using 2D keypoints extracted by ViTPose.
- Metrics: FID, FID, DIV, DIV, Beat Alignment Score (BAS).
- Appearance dimension: Uses VBench dance-specific metrics.
- Metrics: Imaging Quality (IQ), Aesthetic Quality (AQ), Subject Consistency (SC), Background Consistency (BC), Motion Smoothness (MS), Temporal Flickering (TF).
Quantitative Results
Table 1: Quantitative comparison on MA-Data in Music-Driven Dance Video Generation
| Method | IQ ↑ | AQ ↑ | SC ↑ | BC ↑ | MS ↑ | TF ↑ | FID ↓ | FID ↓ | DIV ↑ | DIV ↑ | BAS ↑ |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ground Truth | 67.12 | 53.51 | 91.86 | 92.97 | 98.20 | 96.88 | – | – | 9.24 | 5.31 | 0.526 |
| Hallo2 | 62.64 | 50.79 | 92.48 | 93.84 | 98.30 | 96.56 | 16.55 | 1.29 | 8.11 | 5.47 | 0.505 |
| WAN-S2V | 64.10 | 50.20 | 92.30 | 93.40 | 98.20 | 96.70 | 18.90 | 1.45 | 7.60 | 5.44 | 0.485 |
| Echomimic-V3 | 63.20 | 49.00 | 91.90 | 93.10 | 98.05 | 96.40 | 19.60 | 1.32 | 7.20 | 4.60 | 0.460 |
| EDGE | 63.05 | 49.70 | 91.79 | 93.30 | 98.64 | 97.10 | 21.77 | 1.39 | 9.08 | 5.74 | 0.498 |
| Lodge | 63.69 | 49.22 | 91.67 | 92.98 | 98.46 | 97.05 | 18.73 | 1.49 | 8.87 | 5.71 | 0.474 |
| MEGA | 66.14 | 49.89 | 92.95 | 94.13 | 97.45 | 96.32 | 18.98 | 1.65 | 8.78 | 5.59 | 0.513 |
| MACE-Dance | 65.35 | 51.79 | 93.97 | 94.57 | 98.46 | 97.10 | 16.46 | 0.28 | 9.74 | 6.34 | 0.523 |
MACE-Dance achieves SOTA performance on both motion and appearance metrics.
Table 2: Quantitative comparison on FineDance in Music-Driven 3D Dance Generation
| Method | FID ↓ | FID ↓ | FSR ↓ | DIV ↑ | DIV ↑ | BAS ↑ | FPS ↑ |
|---|---|---|---|---|---|---|---|
| Ground Truth | – | – | 0.216 | 9.94 | 7.54 | 0.201 | – |
| FACT | 113.38 | 97.05 | 0.284 | 3.36 | 6.37 | 0.183 | 29 |
| MNET | 104.71 | 90.31 | 0.394 | 3.12 | 6.14 | 0.186 | 26 |
| Bailando | 82.81 | 28.17 | 0.188 | 7.74 | 6.25 | 0.202 | 188 |
| EDGE | 94.34 | 50.38 | 0.200 | 8.13 | 6.45 | 0.212 | 119 |
| Lodge | 50.00 | 35.52 | 0.028 | 5.67 | 4.96 | 0.226 | 224 |
| MEGA | 50.00 | 13.02 | 0.243 | 6.23 | 6.27 | 0.226 | 238 |
| Motion Expert (Full) | 17.83 | 25.09 | 0.210 | 10.30 | 8.09 | 0.229 | 770 |
The Motion Expert achieves SOTA performance, with high fidelity, diversity, synchronization, and efficiency (770 FPS).
Table 3: Quantitative comparison on MA-Data in Pose-Driven Image Animation
| Method | FVD ↓ | SSIM ↑ | LPIPS ↓ | PSNR ↑ |
|---|---|---|---|---|
| Animate-Anyone | 515.26 | 0.648 | 0.091 | 19.65 |
| Magic-Animate | 1032.06 | 0.311 | 0.207 | 14.00 |
| Wan-Animate | 332.82 | 0.707 | 0.078 | 21.11 |
| w/o. Kinematic stage | 328.91 | 0.596 | 0.107 | 18.69 |
| w/o. Aesthetic stage | 445.93 | 0.563 | 0.121 | 17.89 |
| Appearance Expert | 274.94 | 0.739 | 0.066 | 22.40 |
The Appearance Expert achieves SOTA performance, validated by the two-stage fine-tuning ablation.
Qualitative Analysis
- Effect Comparison: MACE-Dance generates videos with kinematically plausible and artistically expressive motion, and spatiotemporally coherent appearance, outperforming baselines like Hallo2, EDGE, Lodge, MEGA, WAN-S2V, and Echomimic-V3 (see Fig. 3).
- Cross-Genre Generation: Effectively generates distinct genre-specific motions (Uyghur, Dunhuang, Dai, K-Pop, Popping) as shown in Fig. 4.
- Long-Sequence Generation: Produces coherent long-sequence dance videos (up to 30 seconds) thanks to the BiMamba–Transformer hybrid and pose-driven relay rendering (see Fig. 6).
Ablation Studies
Motion Expert Ablation (Table 2):
- BiMamba → Mamba: Removes bidirectional context, degrading dance quality metrics (FID=65.10, FID=51.74) though efficiency improves (1044 FPS). Generated dances become simpler.
- BiMamba → Transformer: Deprives non-autoregressive generation, causing collapse to in-place jitter. Metrics drop severely (FID=104.93, FID=114.42).
- GFT → CFG: Replacing GFT with classifier-free guidance leads to modest decline in metrics and lower generation efficiency.
Appearance Expert Ablation (Table 3 & Fig. 7):
- w/o. Kinematic Stage: Leads to modest metric decline and noticeable kinematic errors/motion blur.
- w/o. Aesthetic Stage: Causes substantial degradation and obvious ghosting artifacts.
- The full Appearance Expert outperforms Wan-Animate baseline.
Motion Representation (2D vs. 3D) (Table 4):
| Representation | FID ↓ | FID ↓ | DIV ↑ | DIV ↑ | BAS ↑ | AQ ↑ | SC ↑ | FID ↓ | BAS ↑ |
|---|---|---|---|---|---|---|---|---|---|
| 2D | 22.8 | 8.6 | 6.12 | 5.24 | 0.527 | 51.86 | 91.84 | 23.73 | 0.496 |
| 3D | 19.5 | 4.1 | 8.87 | 5.92 | 0.543 | 51.79 | 93.97 | 16.46 | 0.523 |
3D representation consistently outperforms 2D across both motion and final video metrics.
Role of Each Expert (Cross-Composition Analysis, Table 5):
| Method | AQ ↑ | SC ↑ | FID ↓ | BAS ↑ |
|---|---|---|---|---|
| w/o.ME (EDGE + Our AE) | 50.21 | 92.10 | 20.84 | 0.499 |
| w/o.AE (Our ME + Wan-Animate) | 50.36 | 91.42 | 17.92 | 0.519 |
| Ours (Full MACE-Dance) | 51.79 | 93.97 | 16.46 | 0.523 |
Both experts contribute positively; the Motion Expert strengthens music-motion alignment (BAS), and the Appearance Expert improves visual quality (AQ, SC).
Comparison with Video Foundation Models (Table 6):
| Method | AQ ↑ | SC ↑ | FID ↓ | BAS ↑ |
|---|---|---|---|---|
| CogVideoX1.5-5B | 50.38 | 89.92 | 22.47 | 0.477 |
| WAN2.2-5B | 53.22 | 90.77 | 17.53 | 0.452 |
| Ours | 51.79 | 93.97 | 16.46 | 0.523 |
MACE-Dance achieves better overall performance, particularly in SC, FID, and BAS, indicating stronger music-motion alignment and visual quality.
Theoretical and Practical Implications
- Task Decoupling: The cascaded expert design effectively isolates motion semantics from visual appearance, reducing the complexity of learning a direct music-to-video mapping. The explicit 3D motion representation suppresses spurious cross-modal correlations and provides an interpretable intermediate interface.
- Architectural Innovations: The BiMamba–Transformer hybrid backbone combines local dependency modeling (BiMamba) with global cross-modal context (Transformer), enabling high-quality, non-autoregressive sequence generation.
Related papers
- World Action Models: A Survey
World Action Models unify vision-language-action and world models; the field trend is generating less of the future while preserving control information.
- MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management
Context-as-Action treats context management as first-class policy actions, achieving 62.5% Pass@3 on MemGUI-Bench and 41% fewer failures.
- Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
RNG-Bench reveals top multimodal models struggle with non-Markov memory-for-action, achieving only ~62% on hardest configurations despite fine-tuning improvements.