Summary (Overview)

  • PhysisForcing is a training-time framework that improves the physical plausibility of robotic manipulation videos generated by diffusion-based video models, without incurring extra inference cost.
  • It introduces hierarchical physics alignment on interaction-critical regions only: (1) pixel-level trajectory alignment using point tracking, and (2) semantic-level relational alignment using a frozen video understanding encoder.
  • Evaluated on R-Bench, PAI-Bench, and EZS-Bench, PhysisForcing consistently outperforms base models, vanilla fine-tuning, and strong commercial/robotics-specific baselines (e.g., Wan2.6, Veo 3.1, Abot-PhysWorld).
  • On the WorldArena action-planner protocol, it raises the closed-loop success rate from 16.0% to 24.0%, and improves downstream policy success on RoboTwin 2.0 tasks (+4.6% average).
  • The framework is backbone-agnostic—tested on Wan2.2-I2V-A14B, Wan2.2-TI2V-5B, and Cosmos3-Nano—and the best variant (PF-Cosmos) achieves the highest overall scores on all three benchmarks.

Introduction and Theoretical Foundation

Video generation models have emerged as promising world simulators for embodied intelligence, providing scalable visual futures for data generation, world simulation, and policy learning. However, embodied world simulation demands more than photorealistic videos; the generated dynamics must be physically plausible, especially in contact-rich manipulation. Current models often produce physically implausible outputs such as discontinuous gripper trajectories, object penetration, or anti-gravity motion.

Existing approaches provide only partial solutions:

  • General video models lack sufficient exposure to embodied contact dynamics.
  • Robot-oriented models are trained with reconstruction objectives that treat physically critical regions and background pixels uniformly.
  • Physics-aware methods (geometry-based, preference-based, simulator-based) capture only local motion consistency or operate post-hoc, lacking a unified mechanism for aligning both local dynamics and global interaction outcomes.

Key insight of this work: Physical plausibility in manipulation videos is naturally hierarchical:

  • At the pixel level, local motion should satisfy trajectory continuity, depth consistency, and contact-compatible displacement.
  • At the semantic level, object relations should evolve according to interaction semantics (e.g., a pushed object should move away, a grasped object should remain coupled with the gripper).

Moreover, physical evidence is highly localized around manipulators, objects, contacts, and moving regions. Applying supervision uniformly over all pixels dilutes these signals. Based on this, the authors propose PhysisForcing—a region-focused hierarchical physics alignment framework.

Methodology

PhysisForcing injects physics supervision into video generation through three main stages:

1. Physics-Informative Region Extraction

Given an input video VRT×C×H×WV \in \mathbb{R}^{T \times C \times H \times W}, an off-the-shelf point tracker (CoTracker3) obtains dense temporal trajectories P={pi1:T}i=1NP = \{p_{i}^{1:T}\}_{i=1}^{N} where N=H×WN = H \times W and pitR2p_i^t \in \mathbb{R}^2 is the 2D location of point ii at frame tt.

Motion magnitude for point ii is computed as:

ai=t=1T1pit+1pit2a_i = \sum_{t=1}^{T-1} \|p_i^{t+1} - p_i^t\|_2

To focus on foreground areas, a depth-aware foreground weight is computed from the first-frame depth map D0D^0:

ri=1D0(pi0)+ϵ,qi=airir_i = \frac{1}{D^0(p_i^0) + \epsilon}, \quad q_i = a_i \cdot r_i

The physics-informative score qiq_i is thresholded adaptively (by its mean) to create a binary mask MphyM^{\text{phy}}, which localizes regions where robot-object interactions are likely to occur.

2. Pixel-Level Physics Alignment

This loss directly enforces per-point trajectory continuity on the manipulator and manipulated object. From an intermediate DiT block (layer ll), hidden features HlH^l are refined by a lightweight MLP ϕ()\phi(\cdot) and reshaped into frame-wise feature maps F^RT×C×H×W\hat{F} \in \mathbb{R}^{T \times C \times H \times W}.

For each query point pi0p_i^0 in the first frame, its query feature Q(pi0)Q(p_i^0) is compared with every spatial location in frame tt via dot-product similarity:

sit(x)=Q(pi0)Kt(x)C,xΩs_i^t(x) = \frac{Q(p_i^0)^\top K^t(x)}{\sqrt{C}}, \quad x \in \Omega

The predicted point location is computed by coordinate expectation:

p^it=xΩSoftmaxx(sit(x))x\hat{p}_i^t = \sum_{x \in \Omega} \text{Softmax}_x \big(s_i^t(x)\big) \cdot x

The loss is a masked mean squared error over physics-informative regions only:

Lpixphy=1MphyMphy(PpredPgt)22\mathcal{L}^{\text{phy}}_{\text{pix}} = \frac{1}{|M^{\text{phy}}|} \| M^{\text{phy}} \odot (P_{\text{pred}} - P_{\text{gt}}) \|_2^2

3. Semantic-Level Physics Alignment

This loss aligns the pairwise token-similarity matrix of the DiT feature with that of a frozen video understanding encoder (e.g., DINO) on the same physics-informative tokens.

The DiT hidden feature HlH^l is projected via another MLP ψ()\psi(\cdot) and resized to match the encoder's token layout. The physics mask is also resized to select relevant tokens F^M,FMRK×C\hat{F}_M, F_M \in \mathbb{R}^{K \times C}. Pairwise cosine similarity matrices are computed:

R^(i,j)=F^MiF^MjF^Mi2F^Mj2,R(i,j)=FMiFMjFMi2FMj2\hat{R}(i,j) = \frac{\hat{F}_M^i \cdot \hat{F}_M^j}{\|\hat{F}_M^i\|_2 \|\hat{F}_M^j\|_2}, \quad R(i,j) = \frac{F_M^i \cdot F_M^j}{\|F_M^i\|_2 \|F_M^j\|_2}

The semantic-level loss is the mean absolute difference between the two matrices:

Lsemphy=1K2i=1Kj=1KR^(i,j)R(i,j)\mathcal{L}^{\text{phy}}_{\text{sem}} = \frac{1}{K^2} \sum_{i=1}^K \sum_{j=1}^K |\hat{R}(i,j) - R(i,j)|

4. Training Objective

PhysisForcing is applied during fine-tuning of a pre-trained DiT-based video generation backbone. The total loss is:

L=LFM+λpixLpixphy+λsemLsemphy\mathcal{L} = \mathcal{L}_{\text{FM}} + \lambda_{\text{pix}} \mathcal{L}^{\text{phy}}_{\text{pix}} + \lambda_{\text{sem}} \mathcal{L}^{\text{phy}}_{\text{sem}}

where LFM\mathcal{L}_{\text{FM}} is the standard flow matching loss. All auxiliary models are discarded at inference, introducing no extra cost.

Empirical Validation / Results

Benchmarks

  • R-Bench: 650 image-text pairs across task-oriented and embodiment-specific dimensions.
  • PAI-Bench (robot domain): 174 real-world image-prompt pairs.
  • EZS-Bench: 196 unseen robot-task-scene combinations (zero-shot, training-independent).
  • WorldArena: Action-planner protocol with closed-loop success rate.
  • Policy learning: Fast-WAM on 6 RoboTwin 2.0 tasks (200 rollouts each).

Key Results

Table 1: R-Bench quantitative results (excerpt with top-performing methods)

ModelAvg.Tasks (Avg)Embodiments (Avg)
Wan2.6 (Commercial)60.754.666.1
Veo 3.1 (Commercial)59.954.166.6
Cosmos3-Super (Robotics)58.148.764.9
PF-Cosmos (Ours)63.858.969.0
PF-Wan (Ours)62.056.468.7

PF-Cosmos achieves the best overall score (63.8), surpassing all baselines including the strongest commercial model Wan2.6 (60.7). PF-Wan ranks second (62.0), improving over its base by 22.3%.

Policy Learning (Table 2): Downstream success rate on RoboTwin 2.0

TaskBaseline+PhysisForcingΔ
place_empty_cup41.563.0+21.5
press_stapler49.060.0+11.0
grab_roller58.563.0+4.5
shake_bottle97.594.5-3.0
adjust_bottle93.093.00.0
stack_bowls_two69.563.0-6.5
Average68.272.8+4.6

WorldArena (Table 3): Closed-loop success rate

ModelTask 1Task 2Avg.
WoW20.021.020.5
Wan2.2-TI2V-5B12.020.016.0
+ PhysisForcing22.026.024.0

Ablation Studies

Table 4: Per-component ablation on R-Bench (Wan2.2-I2V-A14B)

ModelEmb.TasksAvg.
ft64.752.557.9
+ Lpixphy\mathcal{L}^{\text{phy}}_{\text{pix}}67.555.260.7
+ Lsemphy\mathcal{L}^{\text{phy}}_{\text{sem}}66.854.660.0
+ Both69.056.362.0

Both losses are complementary; pixel-level gives larger single gain, but combining them is best.

Table 5: Physics region focus ablation (Wan2.2-TI2V-5B)

ModelEmb.TasksAvg.
ft56.535.444.8
w/o Physics region focus57.037.246.0
w/ Physics region focus58.238.947.5

Focusing supervision on interaction-critical regions provides an additional boost, especially on task-oriented dimensions.

Table 6: Alignment-block (layer index) ablation (PAI-Bench robot domain)

Layer10152025
Score83.985.284.183.2

A middle layer (layer 15) is optimal—balancing appearance features and noise-prediction specialization.

Theoretical and Practical Implications

  • Theoretical contribution: The paper formalizes physical plausibility in robotic video generation as a hierarchical and region-focused alignment problem, distinguishing between pixel-level motion constraints and semantic-level relational consistency. This provides a principled framework for injecting physics knowledge into generative models.
  • Practical benefits:
    • No extra inference cost—all auxiliary models are discarded after training.
    • Backbone-agnostic—works across different model families (Wan, Cosmos) and scales.
    • Improves downstream robotics tasks—physically aligned video representations yield better world models for action planning and policy learning, bridging the gap between video generation and embodied decision-making.
  • The results suggest that explicit physics supervision on interaction-critical regions is more effective than uniform pixel-level objectives, and that hierarchical alignment can correct both local trajectory errors and global relational failures.

Conclusion

PhysisForcing introduces a region-focused hierarchical physics alignment framework for training video generation models used as embodied world simulators. By jointly optimizing pixel-level trajectory consistency and semantic-level relational coherence on interaction-critical regions, it consistently surpasses base models, vanilla fine-tuning, and strong open-source, commercial, and robotics-specific baselines across multiple benchmarks (R-Bench, PAI-Bench, EZS-Bench). The best variant, PF-Cosmos, achieves the highest overall scores on all three benchmarks. Beyond generation, PhysisForcing raises the WorldArena closed-loop success rate from 16.0% to 24.0% and improves downstream policy success on contact-rich tasks, demonstrating that physically aligned video models yield concrete benefits for embodied intelligence. The framework introduces no extra inference cost and is backbone-agnostic, making it a practical solution for improving the reliability of video-based world simulators.

Related papers