Summary (Overview)

  • New Task: Introduces goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object, and a language instruction, predict the future 3D trajectory of each point in metric world coordinates.
  • Large-Scale Dataset: Builds MolmoMotion-1M, the largest action-described, object-grounded 3D point trajectory corpus, automatically annotated from 1.16M unconstrained videos covering 736 action verbs and 5,692 objects.
  • Benchmark: Proposes PointMotionBench, a human-verified evaluation benchmark with 742 clips spanning 111 object categories and 61 motion types, using ground-truth 3D capture where available.
  • Model: Develops MolmoMotion, a general motion forecasting model with two variants – autoregressive (coordinate sequences as text) and flow-matching (continuous trajectory distribution) – both built on Molmo2 VLM backbone.
  • Transfer Results: MolmoMotion significantly outperforms all baselines on PointMotionBench; its learned motion prior transfers effectively to robot manipulation (76.3% success on MolmoSpaces vs. 56.0% baseline) and provides explicit motion control for video generation (improves over CogVideoX-5B and Wan2.2 on most metrics).

Introduction and Theoretical Foundation

Motion forecasting is central to visual intelligence: agents must anticipate object motion for planning, reasoning, and synthesis. Prior representations (pixels, parametric 3D models, 2D point tracks) suffer from category-specific templates, entanglement with camera motion, or difficulty in direct use by downstream systems. The paper argues that object-attached 3D points in world coordinates satisfy three key requirements:

  • Class-agnostic: No dependency on category-specific templates (human, hand, rigid, etc.).
  • View-stable: Same physical motion is represented consistently across cameras and viewpoints.
  • Physically grounded: Directly usable by downstream tasks (robotics, simulation, video generation).

The authors formalize the task as follows: Given a reference time t0t_0, the model receives NN user-specified 2D query points {qt0nR2}n=1N\{q_{t_0}^n \in \mathbb{R}^2\}_{n=1}^N on an object and their corresponding initial 3D positions {pt0nR3}n=1N\{p_{t_0}^n \in \mathbb{R}^3\}_{n=1}^N in the camera coordinate frame at t0t_0, a short history of RGB observations Its:t0={Its,,It0}I_{t_s:t_0} = \{I_{t_s}, \dots, I_{t_0}\}, and a language description aa of the intended action. The goal is to predict future 3D positions {p^tnR3}t=t0+1t0+T}n=1N\{\hat{p}_t^n \in \mathbb{R}^3\}_{t=t_0+1}^{t_0+T}\}_{n=1}^N in a world coordinate frame anchored at the camera at time t0t_0.

Methodology

Model Architecture (MolmoMotion)

All variants share a common input encoding using Molmo2 (4B) as the vision-language backbone. The vision encoder produces image tokens TimgT_{\text{img}} from the RGB history; the action description aa is tokenized into language tokens TtextT_{\text{text}}; and for each 2D query point qt0nq_{t_0}^n, a point feature eptne_{\text{pt}}^n is obtained by bilinearly sampling the anchor-frame feature map Ft0F_{t_0}. The concatenated tokens C=[Timg,Ttext,Tpt]C = [T_{\text{img}}, T_{\text{text}}, T_{\text{pt}}] are processed by the language model.

Coordinate Representation: All 3D coordinates are represented relative to the first query point at t0t_0 (anchor panc=pt01p_{\text{anc}} = p_{t_0}^1):

δtn=ptnpanc\delta_t^n = p_t^n - p_{\text{anc}}

Coordinates are in metric meters.

Autoregressive (AR) variant: Discretizes anchor-relative coordinates into millimeter bins: δˉtn=round(1000δtn)\bar{\delta}_t^n = \text{round}(1000 \cdot \delta_t^n) and serializes them as timestamped point-coordinate tuples (e.g., <track coord="t0 p1 x y z p2 x y z"></track>). The model generates future trajectory tokens y1:Ly_{1:L} in temporal order using a standard next-token objective. At inference, it decodes autoregressively, conditioning each future timestamp on all earlier coordinates.

Flow-matching (FM) variant: Uses a DiT decoder conditioned on Molmo2 features from all layers. It concatenates clean initial 3D query coordinates with a noised version of future coordinates, projects them into point tokens, and applies RoPE along both point and time axes. Training uses standard flow-matching: sample Gaussian noise ϵ\epsilon with shape of future trajectory x\mathbf{x}, linearly interpolate xτ=(1τ)ϵ+τx\mathbf{x}_\tau = (1-\tau)\epsilon + \tau \mathbf{x}, and train the decoder to predict the velocity field vτ=x˙τ\mathbf{v}_\tau = \dot{\mathbf{x}}_\tau. At inference, integrate from Gaussian noise to clean trajectory with 10 Euler steps.

Data: MolmoMotion-1M

An automatic five-stage annotation pipeline is applied to 1.16M public videos (EgoDex, HD-EPIC, Xperience-10M, YT-VIS, Stereo4D):

  1. Semantic object grounding: LLM extracts moving entity from action description; MolmoPoint localizes it as a 2D point; SAM3 produces object mask; NN query points sampled via K-means inside mask.
  2. 2D point tracking and metric 3D lifting: AllTracker provides 2D tracks; ViPE estimates per-frame metric depth and camera geometry; back-projection yields metric 3D tracks {p~tn}\{\tilde{p}_t^n\} in world frame anchored at first camera.
  3. Trajectory-level filtering and smoothing: Removes outlier tracks using MAD-based criterion; smooths depth values along camera rays.
  4. Video-level clipping: Computes per-frame motion score st=mediannptnpt1n2s_t = \text{median}_n \| p_t^n - p_{t-1}^n \|_2; extracts contiguous segments with non-trivial motion.

Corpus statistics: 736 unique action verbs, 5,692 unique objects; median clip length 0.8–1.1 s (manipulation) to 1.7 s (Stereo4D); median 3D displacement 7–9 cm to 51 cm; median 88 query points per clip.

Benchmark: PointMotionBench

Held-out benchmark with 742 clips from HOT3D (ground-truth 3D mesh), WorldTrack (ground-truth 3D points), and DAVIS (human-verified pipeline annotations). Covers 111 object categories and 61 motion types. All annotations are human-verified.

Empirical Validation / Results

3D Point Motion Forecasting (Table 1)

Evaluation on PointMotionBench with metrics ADE (mean displacement error in meters), FDE (final displacement error), and PWT (average fraction of points within thresholds 0.01–0.20 m). Best-of-5 evaluation. MolmoMotion-AR with 3 input frames achieves the best overall performance, significantly outperforming all baselines including non-parametric, pixel-space video prediction (Wan2.2, Cosmos Predict), parametric 3D models (ObjectForesight, EgoScaler, Robot4DGen), and 2D track methods (Track2Act).

ParadigmModelInputsTextHOT3D ADE ↓HOT3D FDE ↓HOT3D PWT ↑WorldTrack ADE ↓WorldTrack FDE ↓WorldTrack PWT ↑DAVIS ADE ↓DAVIS FDE ↓DAVIS PWT ↑
Non-parametricStatic10.1800.3160.2930.1670.3170.3902.2814.3600.085
Extrapolate30.1590.3090.3510.1840.4320.4362.6835.7410.104
Pixel-spaceWan2.2-5B10.2000.3080.2530.8521.0460.0903.0745.1920.051
Pixel-spaceCosmos Predict50.2250.2940.1990.8310.9880.0724.1916.3680.033
3D modelObjectForesight30.1290.1920.353
3D modelEgoScaler10.1700.1790.218
3D modelRobot4DGen30.2120.2710.1120.5480.7040.1212.1203.3820.081
2D trackTrack2Act10.2940.4130.2021.2301.5670.0534.8538.1100.018
3D track (Ours)MolmoMotion-FM10.1830.3110.2860.1650.3050.4011.3802.2050.165
3D track (Ours)MolmoMotion-FM30.1350.2550.3820.1580.2950.4381.4802.5200.130
3D track (Ours)MolmoMotion-AR10.1570.2900.3030.1480.2690.4241.1461.8430.199
3D track (Ours)MolmoMotion-AR30.1090.2170.4440.1430.2610.4451.2272.1080.153

Table 1: 3D point trajectory prediction on PointMotionBench. MolmoMotion-AR with 3 input frames achieves the best overall results.

Transfer to Robotics Planning

Two settings: (1) MolmoSpaces Franka Pick-and-Place – training a MolmoBot policy (flow-matching action head) on 20K episodes with either Molmo2 or MolmoMotion-AR initialization. MolmoMotion initialization yields 76.3% final success vs. 56.0% for Molmo2, and reaches 51% at 10K steps (vs. 19%). (2) DROID – finetuning MolmoMotion on single-camera real robot videos; MolmoMotion initialization achieves substantially lower trajectory L2 error and faster convergence compared to Molmo2 initialization.

Transfer to Video Generation (Table 2)

Using predicted trajectories from MolmoMotion to condition DaS (3D point-trajectory-guided I2V on CogVideoX-5B). Compared to caption-conditioned baselines (CogVideoX-5B, Wan2.2-I2V-A14B), DaS+MolmoMotion improves temporal consistency, subject consistency, motion smoothness, and background consistency, with more physically plausible motion in qualitative examples.

MethodTem-ConSubj-ConsM-SmoothDyn-DegBg-Cons
CogVideoX-5B0.9640.9390.9880.8610.941
Wan-14B0.9650.9400.9830.9080.947
DaS + MolmoMotion0.9680.9500.9900.8760.948

Table 2: Video generation quality metrics (VBench) on PointMotionBench videos.

Theoretical and Practical Implications

  • Category-agnostic motion representation: 3D world-coordinate point trajectories provide a general, physically grounded intermediate representation that decouples motion from embodiment, camera viewpoint, and object category. This enables learning from diverse human videos and transferring to robot domains.
  • Training efficiency for robotics: Initializing robot policies with motion pretrained on internet-scale human video substantially reduces the number of robot episodes required for learning (51% success at 10K steps vs. 19% without pretraining).
  • Controllable video generation: The predicted trajectories serve as an explicit motion control signal, improving physical plausibility and action fidelity in generated videos, even when compared to much larger video generation models.
  • Scalable data pipeline: The automatic annotation pipeline demonstrates that 3D point trajectories can be extracted at scale from unconstrained videos, opening the door to leveraging massive internet video corpora for 3D motion understanding.

Conclusion

MolmoMotion introduces a full stack for language-conditioned 3D point motion forecasting: a large-scale dataset (MolmoMotion-1M), a human-verified benchmark (PointMotionBench), and a general model (MolmoMotion) with both autoregressive and flow-matching variants. The model significantly outperforms existing motion prediction methods and transfers effectively to robot manipulation and video generation. Limitations include sparse point predictions (only 8 points per object due to context length) and the need for more downstream evaluations (e.g., closed-loop real-robot experiments). Future directions include denser point representations, longer-range motion forecasting, and broader downstream applications.

Related papers