# Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

> MoTok uses compact tokens for semantic planning and a diffusion decoder for kinematic control, enabling motion fidelity to improve under stronger constraints with far fewer tokens than prior methods.

- **Source:** [arXiv](https://arxiv.org/abs/2603.19227)
- **Published:** 2026-03-21
- **Permalink:** https://picx.dev/p/IXtC9Y
- **Whiteboard:** https://picx.dev/p/IXtC9Y/image

## Summary

# Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

## Summary (Overview)
*   **Proposes a unified three-stage framework** (Perception–Planning–Control) for conditional motion generation that separates high-level semantic planning from low-level kinematic control.
*   **Introduces MoTok**, a novel diffusion-based discrete motion tokenizer that decouples semantic abstraction (handled by compact tokens) from fine-grained motion reconstruction (handled by a diffusion decoder).
*   **Achieves superior efficiency and performance**, using only **1/6 of the tokens** compared to prior methods while significantly improving both fidelity (lower FID) and controllability (lower trajectory error) on text-and-trajectory generation tasks.
*   **Demonstrates a unique advantage**: Unlike prior methods, motion fidelity **improves** under stronger kinematic constraints rather than degrading, due to the coarse-to-fine conditioning scheme.

## Introduction and Theoretical Foundation
Human motion generation is crucial for animation, robotics, and embodied agents. A central challenge is integrating fine-grained, time-varying kinematic control signals (e.g., trajectories) with high-level semantic intent (e.g., text descriptions). Existing approaches face a trade-off:
*   **Token-based models** (e.g., VQ-VAE, MoMask) compress motion into discrete tokens for efficient sequence modeling but often entangle semantics with low-level details, requiring many tokens or hierarchical codes for faithful reconstruction. This complicates controllable generation as kinematic signals can override semantic conditioning.
*   **Continuous diffusion models** excel at reconstructing smooth, detailed motion but operate directly on raw sequences, leading to slow inference.

The paper's core insight is a **division of labor**: discrete tokens should capture semantic abstraction, while diffusion models handle fine-grained reconstruction. This motivates the proposed **Perception–Planning–Control** paradigm and the **MoTok** tokenizer, which bridges the strengths of both paradigms.

## Methodology

### Problem Formulation
*   A motion sequence is $\theta_{1:T} = \{\theta_t\}_{t=1}^T$, where $\theta_t \in \mathbb{R}^D$.
*   It is encoded into a discrete token sequence $z_{1:N} = \{z_n\}_{n=1}^N$, where $z_n \in \{1, \ldots, K\}$ indexes a codebook of size $K$. The compression ratio is $\rho = T/N$.
*   Conditions are categorized as:
    *   **Global conditions** $c_g$: Sequence-level guidance (e.g., text).
    *   **Local conditions** $c^s_{1:T}$: Time-aligned kinematic signals (e.g., trajectories, keyframes).

### MoTok: Diffusion-based Discrete Motion Tokenizer
MoTok factorizes motion representation into compact discrete codes and a diffusion decoder.

1.  **Convolutional Encoder:** Extracts latent features with temporal downsampling.
    $$ \mathbf{h}_{1:N} = \mathcal{E}(\theta_{1:T}), \quad \mathbf{h}_{1:N} \in \mathbb{R}^{N \times d} $$

2.  **Vector Quantizer:** Discretizes the latent sequence. For a codebook $\mathcal{C} = \{\mathbf{c}_k\}_{k=1}^K$:
    $$ z_n = \arg\min_{k \in \{1,\ldots,K\}} \|\mathbf{h}_n - \mathbf{c}_k\|_2^2, \quad \mathbf{q}_n = \mathbf{c}_{z_n} $$

3.  **Decoder with Diffusion-based Reconstruction:** Instead of direct regression, quantized latents are decoded into a per-frame conditioning sequence, which guides a conditional diffusion model.
    *   A convolutional decoder upsamples: $\mathbf{s}_{1:T} = \mathcal{D}(\mathbf{q}_{1:N})$.
    *   A conditional diffusion denoiser $f_\phi$ predicts clean motion from noisy input:
        $$ \hat{\mathbf{x}}_0 = f_\phi(\mathbf{x}_t, t, \mathbf{s}_{1:T}) $$
    *   The reverse diffusion process is $p_\phi(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{s}_{1:T})$.

4.  **Training Objective:** Combines a diffusion reconstruction loss and a VQ commitment loss.
    $$ \mathcal{L} = \mathcal{L}_{\text{diff}} + \lambda_{\text{commit}} \mathcal{L}_{\text{commit}} $$
    where $\mathcal{L}_{\text{diff}} = \mathbb{E}_{t,\epsilon}[\ell(\hat{\mathbf{x}}_0, \mathbf{x}_0)]$ and $\ell$ is the Smooth-$\ell_1$ loss.

### Unified Conditional Motion Generation Framework
The framework decouples **planning** in discrete token space from **control** in diffusion decoding.

*   **Planning (Discrete Token Space):** A token generator (supports both Discrete Diffusion and Autoregressive models) produces $z_{1:N}$.
    *   Conditions are injected via a unified interface. A token embedding sequence is:
        $$ \mathbf{H}^0 = [\mathbf{M}_g; \mathbf{h}_1; \ldots; \mathbf{h}_N] $$
    *   Local condition features $\mathbf{M}^s_n$ are added to motion token positions.
    *   **Classifier-Free Guidance (CFG)** is extended for multi-condition scenarios using alternating guidance pairs to balance semantics and control.

*   **Control (Diffusion Decoding):** After planning, tokens are decoded by MoTok into $\mathbf{s}_{1:T}$, and motion is synthesized via conditional diffusion. **Fine-grained kinematic control is enforced during denoising** using gradient-based refinement:
    $$ \hat{\mathbf{x}}_k \leftarrow \hat{\mathbf{x}}_k - \eta \nabla_{\hat{\mathbf{x}}_k} \mathcal{L}_{\text{ctrl}}(\hat{\mathbf{x}}_k, c^s_{1:T}) $$
    where $\mathcal{L}_{\text{ctrl}}$ measures deviation from local conditions (e.g., trajectory error).

## Empirical Validation / Results
Experiments were conducted on HumanML3D and KIT-ML datasets.

### Text and Trajectory Control
MoTok significantly outperforms state-of-the-art methods (MaskControl, InterControl, CrowdMoGen) in both fidelity (FID) and controllability (Trajectory Error), using far fewer tokens.

**Table 1: Controllable motion generation results on HumanML3D (Pelvis control setting).**
| Method | FID ↓ | R-Precision (Top-3) ↑ | Diversity → | Traj. Err. (50cm) ↓ | Avg. Err. (m) ↓ |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **Real Motion** | 0.002 | 0.797 | 9.503 | 0.0000 | 0.0000 |
| MaskControl [29] | 0.061 | **0.809** | **9.496** | 0.0000 | 0.0098 |
| **MoTok-DDM-4** | **0.029** | 0.794 | 9.476 | **0.0000** | **0.0049** |

*   **Key Result:** MoTok-DDM-4 uses **1/6 of the tokens** of MaskControl, yet reduces FID from **0.061 to 0.029** and average error from **0.0098m to 0.0049m**.
*   **Unique Finding:** While prior methods' FID degrades when adding trajectory constraints, MoTok's FID **improves** (e.g., from 0.039 to 0.029), indicating trajectory acts as complementary guidance.

### Text-to-Motion Generation
MoTok achieves strong performance with aggressive token compression.

**Table 2: Text-to-motion results on HumanML3D (excerpt).**
| Method | R-Precision (Top-3) ↑ | FID ↓ | Diversity → |
| :--- | :--- | :--- | :--- |
| Real Motion | 0.797 ± .002 | 0.002 ± .000 | 9.503 ± .065 |
| MoMask [9] | 0.807 ± .002 | 0.045 ± .002 | - |
| **MoTok-DDM-2** | 0.799 ± .002 | **0.033 ± .002** | 9.523 ± 0.09 |
| **MoTok-DDM-4** | 0.793 ± .002 | 0.039 ± .002 | 9.411 ± .078 |

*   With **1/6 of the tokens**, MoTok-DDM-4 achieves a lower FID (0.039) than MoMask (0.045).
*   MoTok-AR-4 reduces T2M-GPT's FID by nearly threefold (0.053 vs. 0.141) under the same token budget.

### Ablation Studies
Key findings from ablation studies (Table 3):
1.  **Decoder Design:** Diffusion-based decoders (especially `DiffusionConv` with temporal convolutions) outperform plain convolutional decoders, crucial for generation under noise.
2.  **Temporal Downsampling Rate:** A moderate rate (2 or 4) balances reconstruction and planning best. Excessive compression removes essential structure.
3.  **Low-level Conditioning Location (Table 4):** Applying constraints **only during planning** harms control error. Applying them **only during decoding** harms fidelity. The **dual-path** approach (both stages) is essential for optimal performance.

**Table 4: Effect of low-level control injection location (DDM Planner).**
| Low-level Condition | Ctrl. FID ↓ | Ctrl. Err. ↓ |
| :--- | :--- | :--- |
| Generator Only | 0.028 | 0.2170 |
| Token Decoder Only | 0.365 | **0.0056** |
| **Generator + Token Decoder** | **0.029** | **0.0049** |

## Theoretical and Practical Implications
*   **Theoretical:** Proposes a principled decomposition of motion generation into semantic planning and kinematic control, formalized by the Perception–Planning–Control paradigm. The diffusion-based tokenizer challenges the conventional VQ-VAE design by decoupling abstraction from reconstruction.
*   **Practical:** Enables **highly efficient and controllable** motion generation. The compact tokens reduce computational burden for downstream planners. The framework's generator-agnostic design supports both AR and DDM backbones. The improvement in fidelity under stronger constraints is particularly valuable for applications requiring precise motion control, such as animation and robotics.

## Conclusion
This work bridges discrete token-based and continuous diffusion-based motion generation. The proposed **MoTok** tokenizer and the **Perception–Planning–Control** framework enable efficient, high-fidelity motion synthesis under combined semantic and kinematic conditions. By offloading fine-grained reconstruction to diffusion decoding, MoTok achieves state-of-the-art performance with a dramatically reduced token budget, demonstrating that compact tokens and high fidelity are not mutually exclusive. Future work may explore extending this paradigm to other sequential data domains.

---

_Markdown view of https://picx.dev/p/IXtC9Y, served by PicX — AI-generated visual whiteboard summaries of research papers._