# PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation

> PhysisForcing enhances physical plausibility of robotic videos via hierarchical alignment on interaction-critical regions, boosting downstream success without extra inference cost.

- **Source:** [arXiv](https://arxiv.org/abs/2606.28128)
- **Published:** 2026-06-30
- **Permalink:** https://picx.dev/p/1YP3e6
- **Whiteboard:** https://picx.dev/p/1YP3e6/image

## Summary

## Summary (Overview)

- **PhysisForcing** is a training-time framework that improves the physical plausibility of robotic manipulation videos generated by diffusion-based video models, without incurring extra inference cost.
- It introduces **hierarchical physics alignment** on **interaction-critical regions** only: (1) pixel-level trajectory alignment using point tracking, and (2) semantic-level relational alignment using a frozen video understanding encoder.
- Evaluated on **R-Bench**, **PAI-Bench**, and **EZS-Bench**, PhysisForcing consistently outperforms base models, vanilla fine-tuning, and strong commercial/robotics-specific baselines (e.g., Wan2.6, Veo 3.1, Abot-PhysWorld).
- On the **WorldArena** action-planner protocol, it raises the closed-loop success rate from 16.0% to 24.0%, and improves downstream policy success on RoboTwin 2.0 tasks (+4.6% average).
- The framework is backbone-agnostic—tested on Wan2.2-I2V-A14B, Wan2.2-TI2V-5B, and Cosmos3-Nano—and the best variant (PF-Cosmos) achieves the highest overall scores on all three benchmarks.

## Introduction and Theoretical Foundation

Video generation models have emerged as promising world simulators for embodied intelligence, providing scalable visual futures for data generation, world simulation, and policy learning. However, **embodied world simulation demands more than photorealistic videos**; the generated dynamics must be physically plausible, especially in contact-rich manipulation. Current models often produce physically implausible outputs such as discontinuous gripper trajectories, object penetration, or anti-gravity motion.

**Existing approaches provide only partial solutions:**
- **General video models** lack sufficient exposure to embodied contact dynamics.
- **Robot-oriented models** are trained with reconstruction objectives that treat physically critical regions and background pixels uniformly.
- **Physics-aware methods** (geometry-based, preference-based, simulator-based) capture only local motion consistency or operate post-hoc, lacking a unified mechanism for aligning both local dynamics and global interaction outcomes.

**Key insight of this work:** Physical plausibility in manipulation videos is naturally **hierarchical**:
- At the **pixel level**, local motion should satisfy trajectory continuity, depth consistency, and contact-compatible displacement.
- At the **semantic level**, object relations should evolve according to interaction semantics (e.g., a pushed object should move away, a grasped object should remain coupled with the gripper).

Moreover, physical evidence is **highly localized** around manipulators, objects, contacts, and moving regions. Applying supervision uniformly over all pixels dilutes these signals. Based on this, the authors propose PhysisForcing—a region-focused hierarchical physics alignment framework.

## Methodology

PhysisForcing injects physics supervision into video generation through three main stages:

### 1. Physics-Informative Region Extraction

Given an input video $V \in \mathbb{R}^{T \times C \times H \times W}$, an off-the-shelf point tracker (CoTracker3) obtains dense temporal trajectories $P = \{p_{i}^{1:T}\}_{i=1}^{N}$ where $N = H \times W$ and $p_i^t \in \mathbb{R}^2$ is the 2D location of point $i$ at frame $t$.

Motion magnitude for point $i$ is computed as:
$$
a_i = \sum_{t=1}^{T-1} \|p_i^{t+1} - p_i^t\|_2
$$

To focus on foreground areas, a depth-aware foreground weight is computed from the first-frame depth map $D^0$:
$$
r_i = \frac{1}{D^0(p_i^0) + \epsilon}, \quad q_i = a_i \cdot r_i
$$

The physics-informative score $q_i$ is thresholded adaptively (by its mean) to create a binary mask $M^{\text{phy}}$, which localizes regions where robot-object interactions are likely to occur.

### 2. Pixel-Level Physics Alignment

This loss directly enforces per-point trajectory continuity on the manipulator and manipulated object. From an intermediate DiT block (layer $l$), hidden features $H^l$ are refined by a lightweight MLP $\phi(\cdot)$ and reshaped into frame-wise feature maps $\hat{F} \in \mathbb{R}^{T \times C \times H \times W}$.

For each query point $p_i^0$ in the first frame, its query feature $Q(p_i^0)$ is compared with every spatial location in frame $t$ via dot-product similarity:
$$
s_i^t(x) = \frac{Q(p_i^0)^\top K^t(x)}{\sqrt{C}}, \quad x \in \Omega
$$

The predicted point location is computed by coordinate expectation:
$$
\hat{p}_i^t = \sum_{x \in \Omega} \text{Softmax}_x \big(s_i^t(x)\big) \cdot x
$$

The loss is a **masked mean squared error** over physics-informative regions only:
$$
\mathcal{L}^{\text{phy}}_{\text{pix}} = \frac{1}{|M^{\text{phy}}|} \| M^{\text{phy}} \odot (P_{\text{pred}} - P_{\text{gt}}) \|_2^2
$$

### 3. Semantic-Level Physics Alignment

This loss aligns the pairwise token-similarity matrix of the DiT feature with that of a frozen video understanding encoder (e.g., DINO) on the same physics-informative tokens.

The DiT hidden feature $H^l$ is projected via another MLP $\psi(\cdot)$ and resized to match the encoder's token layout. The physics mask is also resized to select relevant tokens $\hat{F}_M, F_M \in \mathbb{R}^{K \times C}$. Pairwise cosine similarity matrices are computed:

$$
\hat{R}(i,j) = \frac{\hat{F}_M^i \cdot \hat{F}_M^j}{\|\hat{F}_M^i\|_2 \|\hat{F}_M^j\|_2}, \quad R(i,j) = \frac{F_M^i \cdot F_M^j}{\|F_M^i\|_2 \|F_M^j\|_2}
$$

The semantic-level loss is the mean absolute difference between the two matrices:
$$
\mathcal{L}^{\text{phy}}_{\text{sem}} = \frac{1}{K^2} \sum_{i=1}^K \sum_{j=1}^K |\hat{R}(i,j) - R(i,j)|
$$

### 4. Training Objective

PhysisForcing is applied during fine-tuning of a pre-trained DiT-based video generation backbone. The total loss is:
$$
\mathcal{L} = \mathcal{L}_{\text{FM}} + \lambda_{\text{pix}} \mathcal{L}^{\text{phy}}_{\text{pix}} + \lambda_{\text{sem}} \mathcal{L}^{\text{phy}}_{\text{sem}}
$$
where $\mathcal{L}_{\text{FM}}$ is the standard flow matching loss. All auxiliary models are discarded at inference, introducing no extra cost.

## Empirical Validation / Results

### Benchmarks
- **R-Bench**: 650 image-text pairs across task-oriented and embodiment-specific dimensions.
- **PAI-Bench** (robot domain): 174 real-world image-prompt pairs.
- **EZS-Bench**: 196 unseen robot-task-scene combinations (zero-shot, training-independent).
- **WorldArena**: Action-planner protocol with closed-loop success rate.
- **Policy learning**: Fast-WAM on 6 RoboTwin 2.0 tasks (200 rollouts each).

### Key Results

**Table 1: R-Bench quantitative results (excerpt with top-performing methods)**

| Model | Avg. | Tasks (Avg) | Embodiments (Avg) |
|-------|------|-------------|-------------------|
| Wan2.6 (Commercial) | 60.7 | 54.6 | 66.1 |
| Veo 3.1 (Commercial) | 59.9 | 54.1 | 66.6 |
| Cosmos3-Super (Robotics) | 58.1 | 48.7 | 64.9 |
| **PF-Cosmos (Ours)** | **63.8** | **58.9** | **69.0** |
| PF-Wan (Ours) | 62.0 | 56.4 | 68.7 |

PF-Cosmos achieves the **best overall score** (63.8), surpassing all baselines including the strongest commercial model Wan2.6 (60.7). PF-Wan ranks second (62.0), improving over its base by 22.3%.

**Policy Learning (Table 2): Downstream success rate on RoboTwin 2.0**

| Task | Baseline | +PhysisForcing | Δ |
|------|----------|----------------|---|
| place_empty_cup | 41.5 | 63.0 | +21.5 |
| press_stapler | 49.0 | 60.0 | +11.0 |
| grab_roller | 58.5 | 63.0 | +4.5 |
| shake_bottle | 97.5 | 94.5 | -3.0 |
| adjust_bottle | 93.0 | 93.0 | 0.0 |
| stack_bowls_two | 69.5 | 63.0 | -6.5 |
| **Average** | **68.2** | **72.8** | **+4.6** |

**WorldArena (Table 3): Closed-loop success rate**

| Model | Task 1 | Task 2 | Avg. |
|-------|--------|--------|------|
| WoW | 20.0 | 21.0 | 20.5 |
| Wan2.2-TI2V-5B | 12.0 | 20.0 | 16.0 |
| **+ PhysisForcing** | **22.0** | **26.0** | **24.0** |

### Ablation Studies

**Table 4: Per-component ablation on R-Bench (Wan2.2-I2V-A14B)**

| Model | Emb. | Tasks | Avg. |
|-------|------|-------|------|
| ft | 64.7 | 52.5 | 57.9 |
| + $\mathcal{L}^{\text{phy}}_{\text{pix}}$ | 67.5 | 55.2 | 60.7 |
| + $\mathcal{L}^{\text{phy}}_{\text{sem}}$ | 66.8 | 54.6 | 60.0 |
| + **Both** | **69.0** | **56.3** | **62.0** |

Both losses are complementary; pixel-level gives larger single gain, but combining them is best.

**Table 5: Physics region focus ablation (Wan2.2-TI2V-5B)**

| Model | Emb. | Tasks | Avg. |
|-------|------|-------|------|
| ft | 56.5 | 35.4 | 44.8 |
| w/o Physics region focus | 57.0 | 37.2 | 46.0 |
| **w/ Physics region focus** | **58.2** | **38.9** | **47.5** |

Focusing supervision on interaction-critical regions provides an additional boost, especially on task-oriented dimensions.

**Table 6: Alignment-block (layer index) ablation (PAI-Bench robot domain)**

| Layer | 10 | 15 | 20 | 25 |
|-------|----|----|----|----|
| Score | 83.9 | **85.2** | 84.1 | 83.2 |

A middle layer (layer 15) is optimal—balancing appearance features and noise-prediction specialization.

## Theoretical and Practical Implications

- **Theoretical contribution:** The paper formalizes physical plausibility in robotic video generation as a **hierarchical and region-focused alignment problem**, distinguishing between pixel-level motion constraints and semantic-level relational consistency. This provides a principled framework for injecting physics knowledge into generative models.
- **Practical benefits:**
  - **No extra inference cost**—all auxiliary models are discarded after training.
  - **Backbone-agnostic**—works across different model families (Wan, Cosmos) and scales.
  - **Improves downstream robotics tasks**—physically aligned video representations yield better world models for action planning and policy learning, bridging the gap between video generation and embodied decision-making.
- The results suggest that explicit physics supervision on interaction-critical regions is more effective than uniform pixel-level objectives, and that hierarchical alignment can correct both local trajectory errors and global relational failures.

## Conclusion

PhysisForcing introduces a region-focused hierarchical physics alignment framework for training video generation models used as embodied world simulators. By jointly optimizing pixel-level trajectory consistency and semantic-level relational coherence on interaction-critical regions, it consistently surpasses base models, vanilla fine-tuning, and strong open-source, commercial, and robotics-specific baselines across multiple benchmarks (R-Bench, PAI-Bench, EZS-Bench). The best variant, PF-Cosmos, achieves the highest overall scores on all three benchmarks. Beyond generation, PhysisForcing raises the WorldArena closed-loop success rate from 16.0% to 24.0% and improves downstream policy success on contact-rich tasks, demonstrating that physically aligned video models yield concrete benefits for embodied intelligence. The framework introduces no extra inference cost and is backbone-agnostic, making it a practical solution for improving the reliability of video-based world simulators.

---

_Markdown view of https://picx.dev/p/1YP3e6, served by PicX — AI-generated visual whiteboard summaries of research papers._
