# MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

> MolmoMotion predicts language-conditioned 3D point trajectories from videos, significantly outperforming all baselines and transferring effectively to robotics and video generation.

- **Source:** [arXiv](https://arxiv.org/abs/2606.18558)
- **Published:** 2026-06-19
- **Permalink:** https://picx.dev/p/uRJkOA
- **Whiteboard:** https://picx.dev/p/uRJkOA/image

## Summary

## Summary (Overview)

- **New Task**: Introduces goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object, and a language instruction, predict the future 3D trajectory of each point in metric world coordinates.
- **Large-Scale Dataset**: Builds **MolmoMotion-1M**, the largest action-described, object-grounded 3D point trajectory corpus, automatically annotated from 1.16M unconstrained videos covering 736 action verbs and 5,692 objects.
- **Benchmark**: Proposes **PointMotionBench**, a human-verified evaluation benchmark with 742 clips spanning 111 object categories and 61 motion types, using ground-truth 3D capture where available.
- **Model**: Develops **MolmoMotion**, a general motion forecasting model with two variants – autoregressive (coordinate sequences as text) and flow-matching (continuous trajectory distribution) – both built on Molmo2 VLM backbone.
- **Transfer Results**: MolmoMotion significantly outperforms all baselines on PointMotionBench; its learned motion prior transfers effectively to robot manipulation (76.3% success on MolmoSpaces vs. 56.0% baseline) and provides explicit motion control for video generation (improves over CogVideoX-5B and Wan2.2 on most metrics).

## Introduction and Theoretical Foundation

Motion forecasting is central to visual intelligence: agents must anticipate object motion for planning, reasoning, and synthesis. Prior representations (pixels, parametric 3D models, 2D point tracks) suffer from category-specific templates, entanglement with camera motion, or difficulty in direct use by downstream systems. The paper argues that **object-attached 3D points in world coordinates** satisfy three key requirements:
- **Class-agnostic**: No dependency on category-specific templates (human, hand, rigid, etc.).
- **View-stable**: Same physical motion is represented consistently across cameras and viewpoints.
- **Physically grounded**: Directly usable by downstream tasks (robotics, simulation, video generation).

The authors formalize the task as follows: Given a reference time $t_0$, the model receives $N$ user-specified 2D query points $\{q_{t_0}^n \in \mathbb{R}^2\}_{n=1}^N$ on an object and their corresponding initial 3D positions $\{p_{t_0}^n \in \mathbb{R}^3\}_{n=1}^N$ in the camera coordinate frame at $t_0$, a short history of RGB observations $I_{t_s:t_0} = \{I_{t_s}, \dots, I_{t_0}\}$, and a language description $a$ of the intended action. The goal is to predict future 3D positions $\{\hat{p}_t^n \in \mathbb{R}^3\}_{t=t_0+1}^{t_0+T}\}_{n=1}^N$ in a world coordinate frame anchored at the camera at time $t_0$.

## Methodology

### Model Architecture (MolmoMotion)

All variants share a common input encoding using **Molmo2** (4B) as the vision-language backbone. The vision encoder produces image tokens $T_{\text{img}}$ from the RGB history; the action description $a$ is tokenized into language tokens $T_{\text{text}}$; and for each 2D query point $q_{t_0}^n$, a point feature $e_{\text{pt}}^n$ is obtained by bilinearly sampling the anchor-frame feature map $F_{t_0}$. The concatenated tokens $C = [T_{\text{img}}, T_{\text{text}}, T_{\text{pt}}]$ are processed by the language model.

**Coordinate Representation**: All 3D coordinates are represented relative to the first query point at $t_0$ (anchor $p_{\text{anc}} = p_{t_0}^1$):
$$\delta_t^n = p_t^n - p_{\text{anc}}$$
Coordinates are in metric meters.

**Autoregressive (AR) variant**: Discretizes anchor-relative coordinates into millimeter bins: $\bar{\delta}_t^n = \text{round}(1000 \cdot \delta_t^n)$ and serializes them as timestamped point-coordinate tuples (e.g., `<track coord="t0 p1 x y z p2 x y z"></track>`). The model generates future trajectory tokens $y_{1:L}$ in temporal order using a standard next-token objective. At inference, it decodes autoregressively, conditioning each future timestamp on all earlier coordinates.

**Flow-matching (FM) variant**: Uses a DiT decoder conditioned on Molmo2 features from all layers. It concatenates clean initial 3D query coordinates with a noised version of future coordinates, projects them into point tokens, and applies RoPE along both point and time axes. Training uses standard flow-matching: sample Gaussian noise $\epsilon$ with shape of future trajectory $\mathbf{x}$, linearly interpolate $\mathbf{x}_\tau = (1-\tau)\epsilon + \tau \mathbf{x}$, and train the decoder to predict the velocity field $\mathbf{v}_\tau = \dot{\mathbf{x}}_\tau$. At inference, integrate from Gaussian noise to clean trajectory with 10 Euler steps.

### Data: MolmoMotion-1M

An automatic five-stage annotation pipeline is applied to 1.16M public videos (EgoDex, HD-EPIC, Xperience-10M, YT-VIS, Stereo4D):
1. **Semantic object grounding**: LLM extracts moving entity from action description; MolmoPoint localizes it as a 2D point; SAM3 produces object mask; $N$ query points sampled via K-means inside mask.
2. **2D point tracking and metric 3D lifting**: AllTracker provides 2D tracks; ViPE estimates per-frame metric depth and camera geometry; back-projection yields metric 3D tracks $\{\tilde{p}_t^n\}$ in world frame anchored at first camera.
3. **Trajectory-level filtering and smoothing**: Removes outlier tracks using MAD-based criterion; smooths depth values along camera rays.
4. **Video-level clipping**: Computes per-frame motion score $s_t = \text{median}_n \| p_t^n - p_{t-1}^n \|_2$; extracts contiguous segments with non-trivial motion.

Corpus statistics: 736 unique action verbs, 5,692 unique objects; median clip length 0.8–1.1 s (manipulation) to 1.7 s (Stereo4D); median 3D displacement 7–9 cm to 51 cm; median 88 query points per clip.

### Benchmark: PointMotionBench

Held-out benchmark with 742 clips from HOT3D (ground-truth 3D mesh), WorldTrack (ground-truth 3D points), and DAVIS (human-verified pipeline annotations). Covers 111 object categories and 61 motion types. All annotations are human-verified.

## Empirical Validation / Results

### 3D Point Motion Forecasting (Table 1)

Evaluation on PointMotionBench with metrics **ADE** (mean displacement error in meters), **FDE** (final displacement error), and **PWT** (average fraction of points within thresholds 0.01–0.20 m). Best-of-5 evaluation. MolmoMotion-AR with 3 input frames achieves the best overall performance, significantly outperforming all baselines including non-parametric, pixel-space video prediction (Wan2.2, Cosmos Predict), parametric 3D models (ObjectForesight, EgoScaler, Robot4DGen), and 2D track methods (Track2Act).

| Paradigm | Model | Inputs | Text | HOT3D ADE ↓ | HOT3D FDE ↓ | HOT3D PWT ↑ | WorldTrack ADE ↓ | WorldTrack FDE ↓ | WorldTrack PWT ↑ | DAVIS ADE ↓ | DAVIS FDE ↓ | DAVIS PWT ↑ |
|----------|-------|--------|------|--------------|--------------|--------------|-------------------|-------------------|---------------------|---------------|---------------|----------------|
| Non-parametric | Static | 1 | ✗ | 0.180 | 0.316 | 0.293 | 0.167 | 0.317 | 0.390 | 2.281 | 4.360 | 0.085 |
| | Extrapolate | 3 | ✗ | 0.159 | 0.309 | 0.351 | 0.184 | 0.432 | 0.436 | 2.683 | 5.741 | 0.104 |
| Pixel-space | Wan2.2-5B | 1 | ✓ | 0.200 | 0.308 | 0.253 | 0.852 | 1.046 | 0.090 | 3.074 | 5.192 | 0.051 |
| Pixel-space | Cosmos Predict | 5 | ✓ | 0.225 | 0.294 | 0.199 | 0.831 | 0.988 | 0.072 | 4.191 | 6.368 | 0.033 |
| 3D model | ObjectForesight | 3 | ✗ | 0.129 | 0.192 | 0.353 | – | – | – | – | – | – |
| 3D model | EgoScaler | 1 | ✓ | 0.170 | 0.179 | 0.218 | – | – | – | – | – | – |
| 3D model | Robot4DGen | 3 | ✗ | 0.212 | 0.271 | 0.112 | 0.548 | 0.704 | 0.121 | 2.120 | 3.382 | 0.081 |
| 2D track | Track2Act | 1 | ✗ | 0.294 | 0.413 | 0.202 | 1.230 | 1.567 | 0.053 | 4.853 | 8.110 | 0.018 |
| **3D track (Ours)** | **MolmoMotion-FM** | 1 | ✓ | 0.183 | 0.311 | 0.286 | 0.165 | 0.305 | 0.401 | 1.380 | 2.205 | 0.165 |
| **3D track (Ours)** | **MolmoMotion-FM** | 3 | ✓ | 0.135 | 0.255 | 0.382 | 0.158 | 0.295 | 0.438 | 1.480 | 2.520 | 0.130 |
| **3D track (Ours)** | **MolmoMotion-AR** | 1 | ✓ | 0.157 | 0.290 | 0.303 | 0.148 | 0.269 | 0.424 | 1.146 | 1.843 | 0.199 |
| **3D track (Ours)** | **MolmoMotion-AR** | 3 | ✓ | **0.109** | **0.217** | **0.444** | **0.143** | **0.261** | **0.445** | **1.227** | **2.108** | **0.153** |

*Table 1: 3D point trajectory prediction on PointMotionBench. MolmoMotion-AR with 3 input frames achieves the best overall results.*

### Transfer to Robotics Planning

Two settings: (1) **MolmoSpaces Franka Pick-and-Place** – training a MolmoBot policy (flow-matching action head) on 20K episodes with either Molmo2 or MolmoMotion-AR initialization. MolmoMotion initialization yields 76.3% final success vs. 56.0% for Molmo2, and reaches 51% at 10K steps (vs. 19%). (2) **DROID** – finetuning MolmoMotion on single-camera real robot videos; MolmoMotion initialization achieves substantially lower trajectory L2 error and faster convergence compared to Molmo2 initialization.

### Transfer to Video Generation (Table 2)

Using predicted trajectories from MolmoMotion to condition DaS (3D point-trajectory-guided I2V on CogVideoX-5B). Compared to caption-conditioned baselines (CogVideoX-5B, Wan2.2-I2V-A14B), DaS+MolmoMotion improves temporal consistency, subject consistency, motion smoothness, and background consistency, with more physically plausible motion in qualitative examples.

| Method | Tem-Con | Subj-Cons | M-Smooth | Dyn-Deg | Bg-Cons |
|--------|---------|-----------|----------|---------|---------|
| CogVideoX-5B | 0.964 | 0.939 | 0.988 | 0.861 | 0.941 |
| Wan-14B | 0.965 | 0.940 | 0.983 | 0.908 | 0.947 |
| DaS + MolmoMotion | **0.968** | **0.950** | **0.990** | 0.876 | **0.948** |

*Table 2: Video generation quality metrics (VBench) on PointMotionBench videos.*

## Theoretical and Practical Implications

- **Category-agnostic motion representation**: 3D world-coordinate point trajectories provide a general, physically grounded intermediate representation that decouples motion from embodiment, camera viewpoint, and object category. This enables learning from diverse human videos and transferring to robot domains.
- **Training efficiency for robotics**: Initializing robot policies with motion pretrained on internet-scale human video substantially reduces the number of robot episodes required for learning (51% success at 10K steps vs. 19% without pretraining).
- **Controllable video generation**: The predicted trajectories serve as an explicit motion control signal, improving physical plausibility and action fidelity in generated videos, even when compared to much larger video generation models.
- **Scalable data pipeline**: The automatic annotation pipeline demonstrates that 3D point trajectories can be extracted at scale from unconstrained videos, opening the door to leveraging massive internet video corpora for 3D motion understanding.

## Conclusion

MolmoMotion introduces a full stack for language-conditioned 3D point motion forecasting: a large-scale dataset (MolmoMotion-1M), a human-verified benchmark (PointMotionBench), and a general model (MolmoMotion) with both autoregressive and flow-matching variants. The model significantly outperforms existing motion prediction methods and transfers effectively to robot manipulation and video generation. Limitations include sparse point predictions (only 8 points per object due to context length) and the need for more downstream evaluations (e.g., closed-loop real-robot experiments). Future directions include denser point representations, longer-range motion forecasting, and broader downstream applications.

---

_Markdown view of https://picx.dev/p/uRJkOA, served by PicX — AI-generated visual whiteboard summaries of research papers._
