Summary (Overview)
- PhysisForcing is a training-time framework that improves the physical plausibility of robotic manipulation videos generated by diffusion-based video models, without incurring extra inference cost.
- It introduces hierarchical physics alignment on interaction-critical regions only: (1) pixel-level trajectory alignment using point tracking, and (2) semantic-level relational alignment using a frozen video understanding encoder.
- Evaluated on R-Bench, PAI-Bench, and EZS-Bench, PhysisForcing consistently outperforms base models, vanilla fine-tuning, and strong commercial/robotics-specific baselines (e.g., Wan2.6, Veo 3.1, Abot-PhysWorld).
- On the WorldArena action-planner protocol, it raises the closed-loop success rate from 16.0% to 24.0%, and improves downstream policy success on RoboTwin 2.0 tasks (+4.6% average).
- The framework is backbone-agnostic—tested on Wan2.2-I2V-A14B, Wan2.2-TI2V-5B, and Cosmos3-Nano—and the best variant (PF-Cosmos) achieves the highest overall scores on all three benchmarks.
Introduction and Theoretical Foundation
Video generation models have emerged as promising world simulators for embodied intelligence, providing scalable visual futures for data generation, world simulation, and policy learning. However, embodied world simulation demands more than photorealistic videos; the generated dynamics must be physically plausible, especially in contact-rich manipulation. Current models often produce physically implausible outputs such as discontinuous gripper trajectories, object penetration, or anti-gravity motion.
Existing approaches provide only partial solutions:
- General video models lack sufficient exposure to embodied contact dynamics.
- Robot-oriented models are trained with reconstruction objectives that treat physically critical regions and background pixels uniformly.
- Physics-aware methods (geometry-based, preference-based, simulator-based) capture only local motion consistency or operate post-hoc, lacking a unified mechanism for aligning both local dynamics and global interaction outcomes.
Key insight of this work: Physical plausibility in manipulation videos is naturally hierarchical:
- At the pixel level, local motion should satisfy trajectory continuity, depth consistency, and contact-compatible displacement.
- At the semantic level, object relations should evolve according to interaction semantics (e.g., a pushed object should move away, a grasped object should remain coupled with the gripper).
Moreover, physical evidence is highly localized around manipulators, objects, contacts, and moving regions. Applying supervision uniformly over all pixels dilutes these signals. Based on this, the authors propose PhysisForcing—a region-focused hierarchical physics alignment framework.
Methodology
PhysisForcing injects physics supervision into video generation through three main stages:
1. Physics-Informative Region Extraction
Given an input video , an off-the-shelf point tracker (CoTracker3) obtains dense temporal trajectories where and is the 2D location of point at frame .
Motion magnitude for point is computed as:
To focus on foreground areas, a depth-aware foreground weight is computed from the first-frame depth map :
The physics-informative score is thresholded adaptively (by its mean) to create a binary mask , which localizes regions where robot-object interactions are likely to occur.
2. Pixel-Level Physics Alignment
This loss directly enforces per-point trajectory continuity on the manipulator and manipulated object. From an intermediate DiT block (layer ), hidden features are refined by a lightweight MLP and reshaped into frame-wise feature maps .
For each query point in the first frame, its query feature is compared with every spatial location in frame via dot-product similarity:
The predicted point location is computed by coordinate expectation:
The loss is a masked mean squared error over physics-informative regions only:
3. Semantic-Level Physics Alignment
This loss aligns the pairwise token-similarity matrix of the DiT feature with that of a frozen video understanding encoder (e.g., DINO) on the same physics-informative tokens.
The DiT hidden feature is projected via another MLP and resized to match the encoder's token layout. The physics mask is also resized to select relevant tokens . Pairwise cosine similarity matrices are computed:
The semantic-level loss is the mean absolute difference between the two matrices:
4. Training Objective
PhysisForcing is applied during fine-tuning of a pre-trained DiT-based video generation backbone. The total loss is:
where is the standard flow matching loss. All auxiliary models are discarded at inference, introducing no extra cost.
Empirical Validation / Results
Benchmarks
- R-Bench: 650 image-text pairs across task-oriented and embodiment-specific dimensions.
- PAI-Bench (robot domain): 174 real-world image-prompt pairs.
- EZS-Bench: 196 unseen robot-task-scene combinations (zero-shot, training-independent).
- WorldArena: Action-planner protocol with closed-loop success rate.
- Policy learning: Fast-WAM on 6 RoboTwin 2.0 tasks (200 rollouts each).
Key Results
Table 1: R-Bench quantitative results (excerpt with top-performing methods)
| Model | Avg. | Tasks (Avg) | Embodiments (Avg) |
|---|---|---|---|
| Wan2.6 (Commercial) | 60.7 | 54.6 | 66.1 |
| Veo 3.1 (Commercial) | 59.9 | 54.1 | 66.6 |
| Cosmos3-Super (Robotics) | 58.1 | 48.7 | 64.9 |
| PF-Cosmos (Ours) | 63.8 | 58.9 | 69.0 |
| PF-Wan (Ours) | 62.0 | 56.4 | 68.7 |
PF-Cosmos achieves the best overall score (63.8), surpassing all baselines including the strongest commercial model Wan2.6 (60.7). PF-Wan ranks second (62.0), improving over its base by 22.3%.
Policy Learning (Table 2): Downstream success rate on RoboTwin 2.0
| Task | Baseline | +PhysisForcing | Δ |
|---|---|---|---|
| place_empty_cup | 41.5 | 63.0 | +21.5 |
| press_stapler | 49.0 | 60.0 | +11.0 |
| grab_roller | 58.5 | 63.0 | +4.5 |
| shake_bottle | 97.5 | 94.5 | -3.0 |
| adjust_bottle | 93.0 | 93.0 | 0.0 |
| stack_bowls_two | 69.5 | 63.0 | -6.5 |
| Average | 68.2 | 72.8 | +4.6 |
WorldArena (Table 3): Closed-loop success rate
| Model | Task 1 | Task 2 | Avg. |
|---|---|---|---|
| WoW | 20.0 | 21.0 | 20.5 |
| Wan2.2-TI2V-5B | 12.0 | 20.0 | 16.0 |
| + PhysisForcing | 22.0 | 26.0 | 24.0 |
Ablation Studies
Table 4: Per-component ablation on R-Bench (Wan2.2-I2V-A14B)
| Model | Emb. | Tasks | Avg. |
|---|---|---|---|
| ft | 64.7 | 52.5 | 57.9 |
| + | 67.5 | 55.2 | 60.7 |
| + | 66.8 | 54.6 | 60.0 |
| + Both | 69.0 | 56.3 | 62.0 |
Both losses are complementary; pixel-level gives larger single gain, but combining them is best.
Table 5: Physics region focus ablation (Wan2.2-TI2V-5B)
| Model | Emb. | Tasks | Avg. |
|---|---|---|---|
| ft | 56.5 | 35.4 | 44.8 |
| w/o Physics region focus | 57.0 | 37.2 | 46.0 |
| w/ Physics region focus | 58.2 | 38.9 | 47.5 |
Focusing supervision on interaction-critical regions provides an additional boost, especially on task-oriented dimensions.
Table 6: Alignment-block (layer index) ablation (PAI-Bench robot domain)
| Layer | 10 | 15 | 20 | 25 |
|---|---|---|---|---|
| Score | 83.9 | 85.2 | 84.1 | 83.2 |
A middle layer (layer 15) is optimal—balancing appearance features and noise-prediction specialization.
Theoretical and Practical Implications
- Theoretical contribution: The paper formalizes physical plausibility in robotic video generation as a hierarchical and region-focused alignment problem, distinguishing between pixel-level motion constraints and semantic-level relational consistency. This provides a principled framework for injecting physics knowledge into generative models.
- Practical benefits:
- No extra inference cost—all auxiliary models are discarded after training.
- Backbone-agnostic—works across different model families (Wan, Cosmos) and scales.
- Improves downstream robotics tasks—physically aligned video representations yield better world models for action planning and policy learning, bridging the gap between video generation and embodied decision-making.
- The results suggest that explicit physics supervision on interaction-critical regions is more effective than uniform pixel-level objectives, and that hierarchical alignment can correct both local trajectory errors and global relational failures.
Conclusion
PhysisForcing introduces a region-focused hierarchical physics alignment framework for training video generation models used as embodied world simulators. By jointly optimizing pixel-level trajectory consistency and semantic-level relational coherence on interaction-critical regions, it consistently surpasses base models, vanilla fine-tuning, and strong open-source, commercial, and robotics-specific baselines across multiple benchmarks (R-Bench, PAI-Bench, EZS-Bench). The best variant, PF-Cosmos, achieves the highest overall scores on all three benchmarks. Beyond generation, PhysisForcing raises the WorldArena closed-loop success rate from 16.0% to 24.0% and improves downstream policy success on contact-rich tasks, demonstrating that physically aligned video models yield concrete benefits for embodied intelligence. The framework introduces no extra inference cost and is backbone-agnostic, making it a practical solution for improving the reliability of video-based world simulators.
Related papers
- Trimming the Long-Tail of Visual World Modeling Evaluation
Visual world models fail on long-tail physical interactions, relying on memorized templates rather than true physical reasoning.
- LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories
LabVLA achieves 71% success on laboratory tasks by training a vision-language-action model on synthetic data from the RoboGenesis engine, outperforming prior methods.
- Translation as a Bridging Action: Transferring Manipulation Skills from Humans to Robots
Relative wrist translation in the head-camera frame bridges human and robot actions, outperforming 6DoF on 15 bi-manual tasks.