Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Summary (Overview)

Proposes a unified three-stage framework (Perception–Planning–Control) for conditional motion generation that separates high-level semantic planning from low-level kinematic control.
Introduces MoTok, a novel diffusion-based discrete motion tokenizer that decouples semantic abstraction (handled by compact tokens) from fine-grained motion reconstruction (handled by a diffusion decoder).
Achieves superior efficiency and performance, using only 1/6 of the tokens compared to prior methods while significantly improving both fidelity (lower FID) and controllability (lower trajectory error) on text-and-trajectory generation tasks.
Demonstrates a unique advantage: Unlike prior methods, motion fidelity improves under stronger kinematic constraints rather than degrading, due to the coarse-to-fine conditioning scheme.

Introduction and Theoretical Foundation

Human motion generation is crucial for animation, robotics, and embodied agents. A central challenge is integrating fine-grained, time-varying kinematic control signals (e.g., trajectories) with high-level semantic intent (e.g., text descriptions). Existing approaches face a trade-off:

Token-based models (e.g., VQ-VAE, MoMask) compress motion into discrete tokens for efficient sequence modeling but often entangle semantics with low-level details, requiring many tokens or hierarchical codes for faithful reconstruction. This complicates controllable generation as kinematic signals can override semantic conditioning.
Continuous diffusion models excel at reconstructing smooth, detailed motion but operate directly on raw sequences, leading to slow inference.

The paper's core insight is a division of labor: discrete tokens should capture semantic abstraction, while diffusion models handle fine-grained reconstruction. This motivates the proposed Perception–Planning–Control paradigm and the MoTok tokenizer, which bridges the strengths of both paradigms.

Methodology

Problem Formulation

A motion sequence is $\theta_{1:T} = \{\theta_t\}_{t=1}^T$ , where $\theta_t \in \mathbb{R}^D$ .
It is encoded into a discrete token sequence $z_{1:N} = \{z_n\}_{n=1}^N$ , where $z_n \in \{1, \ldots, K\}$ indexes a codebook of size $K$ . The compression ratio is $\rho = T/N$ .
Conditions are categorized as:
- Global conditions $c_g$ : Sequence-level guidance (e.g., text).
- Local conditions $c^s_{1:T}$ : Time-aligned kinematic signals (e.g., trajectories, keyframes).

MoTok: Diffusion-based Discrete Motion Tokenizer

MoTok factorizes motion representation into compact discrete codes and a diffusion decoder.

Convolutional Encoder: Extracts latent features with temporal downsampling.
$\mathbf{h}_{1:N} = \mathcal{E}(\theta_{1:T}), \quad \mathbf{h}_{1:N} \in \mathbb{R}^{N \times d}$
Vector Quantizer: Discretizes the latent sequence. For a codebook $\mathcal{C} = \{\mathbf{c}_k\}_{k=1}^K$ :
$z_n = \arg\min_{k \in \{1,\ldots,K\}} \|\mathbf{h}_n - \mathbf{c}_k\|_2^2, \quad \mathbf{q}_n = \mathbf{c}_{z_n}$
Decoder with Diffusion-based Reconstruction: Instead of direct regression, quantized latents are decoded into a per-frame conditioning sequence, which guides a conditional diffusion model.
- A convolutional decoder upsamples: $\mathbf{s}_{1:T} = \mathcal{D}(\mathbf{q}_{1:N})$ .
- A conditional diffusion denoiser $f_\phi$ predicts clean motion from noisy input: $\hat{\mathbf{x}}_0 = f_\phi(\mathbf{x}_t, t, \mathbf{s}_{1:T})$
- The reverse diffusion process is $p_\phi(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{s}_{1:T})$ .
Training Objective: Combines a diffusion reconstruction loss and a VQ commitment loss.
$\mathcal{L} = \mathcal{L}_{\text{diff}} + \lambda_{\text{commit}} \mathcal{L}_{\text{commit}}$
where $\mathcal{L}_{\text{diff}} = \mathbb{E}_{t,\epsilon}[\ell(\hat{\mathbf{x}}_0, \mathbf{x}_0)]$ and $\ell$ is the Smooth- $\ell_1$ loss.

Unified Conditional Motion Generation Framework

The framework decouples planning in discrete token space from control in diffusion decoding.

Planning (Discrete Token Space): A token generator (supports both Discrete Diffusion and Autoregressive models) produces $z_{1:N}$ .
- Conditions are injected via a unified interface. A token embedding sequence is: $\mathbf{H}^0 = [\mathbf{M}_g; \mathbf{h}_1; \ldots; \mathbf{h}_N]$
- Local condition features $\mathbf{M}^s_n$ are added to motion token positions.
- Classifier-Free Guidance (CFG) is extended for multi-condition scenarios using alternating guidance pairs to balance semantics and control.
Control (Diffusion Decoding): After planning, tokens are decoded by MoTok into $\mathbf{s}_{1:T}$ , and motion is synthesized via conditional diffusion. Fine-grained kinematic control is enforced during denoising using gradient-based refinement:
$\hat{\mathbf{x}}_k \leftarrow \hat{\mathbf{x}}_k - \eta \nabla_{\hat{\mathbf{x}}_k} \mathcal{L}_{\text{ctrl}}(\hat{\mathbf{x}}_k, c^s_{1:T})$
where $\mathcal{L}_{\text{ctrl}}$ measures deviation from local conditions (e.g., trajectory error).

Empirical Validation / Results

Experiments were conducted on HumanML3D and KIT-ML datasets.

Text and Trajectory Control

MoTok significantly outperforms state-of-the-art methods (MaskControl, InterControl, CrowdMoGen) in both fidelity (FID) and controllability (Trajectory Error), using far fewer tokens.

Table 1: Controllable motion generation results on HumanML3D (Pelvis control setting).

Method	FID ↓	R-Precision (Top-3) ↑	Diversity →	Traj. Err. (50cm) ↓	Avg. Err. (m) ↓
Real Motion	0.002	0.797	9.503	0.0000	0.0000
MaskControl [29]	0.061	0.809	9.496	0.0000	0.0098
MoTok-DDM-4	0.029	0.794	9.476	0.0000	0.0049

Key Result: MoTok-DDM-4 uses 1/6 of the tokens of MaskControl, yet reduces FID from 0.061 to 0.029 and average error from 0.0098m to 0.0049m.
Unique Finding: While prior methods' FID degrades when adding trajectory constraints, MoTok's FID improves (e.g., from 0.039 to 0.029), indicating trajectory acts as complementary guidance.

Text-to-Motion Generation

MoTok achieves strong performance with aggressive token compression.

Table 2: Text-to-motion results on HumanML3D (excerpt).

Method	R-Precision (Top-3) ↑	FID ↓	Diversity →
Real Motion	0.797 ± .002	0.002 ± .000	9.503 ± .065
MoMask [9]	0.807 ± .002	0.045 ± .002	-
MoTok-DDM-2	0.799 ± .002	0.033 ± .002	9.523 ± 0.09
MoTok-DDM-4	0.793 ± .002	0.039 ± .002	9.411 ± .078

With 1/6 of the tokens, MoTok-DDM-4 achieves a lower FID (0.039) than MoMask (0.045).
MoTok-AR-4 reduces T2M-GPT's FID by nearly threefold (0.053 vs. 0.141) under the same token budget.

Ablation Studies

Key findings from ablation studies (Table 3):

Decoder Design: Diffusion-based decoders (especially DiffusionConv with temporal convolutions) outperform plain convolutional decoders, crucial for generation under noise.
Temporal Downsampling Rate: A moderate rate (2 or 4) balances reconstruction and planning best. Excessive compression removes essential structure.
Low-level Conditioning Location (Table 4): Applying constraints only during planning harms control error. Applying them only during decoding harms fidelity. The dual-path approach (both stages) is essential for optimal performance.

Table 4: Effect of low-level control injection location (DDM Planner).

Low-level Condition	Ctrl. FID ↓	Ctrl. Err. ↓
Generator Only	0.028	0.2170
Token Decoder Only	0.365	0.0056
Generator + Token Decoder	0.029	0.0049

Theoretical and Practical Implications

Theoretical: Proposes a principled decomposition of motion generation into semantic planning and kinematic control, formalized by the Perception–Planning–Control paradigm. The diffusion-based tokenizer challenges the conventional VQ-VAE design by decoupling abstraction from reconstruction.
Practical: Enables highly efficient and controllable motion generation. The compact tokens reduce computational burden for downstream planners. The framework's generator-agnostic design supports both AR and DDM backbones. The improvement in fidelity under stronger constraints is particularly valuable for applications requiring precise motion control, such as animation and robotics.

Conclusion

This work bridges discrete token-based and continuous diffusion-based motion generation. The proposed MoTok tokenizer and the Perception–Planning–Control framework enable efficient, high-fidelity motion synthesis under combined semantic and kinematic conditions. By offloading fine-grained reconstruction to diffusion decoding, MoTok achieves state-of-the-art performance with a dramatically reduced token budget, demonstrating that compact tokens and high fidelity are not mutually exclusive. Future work may explore extending this paradigm to other sequential data domains.