CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Summary (Overview)

Core Problem: Existing diffusion models for Human-Object Interaction (HOI) video synthesis often fail on structural stability (hands/faces) and physically plausible contact (avoiding interpenetration).
Proposed Solution: CoInteract, an end-to-end framework with two key innovations embedded into a Diffusion Transformer (DiT) backbone:
1. Human-Aware Mixture-of-Experts (MoE): A lightweight router dispatches tokens to region-specialized experts (head, hand, base) using spatial supervision from bounding boxes, improving fine-grained structural fidelity.
2. Spatially-Structured Co-Generation: A dual-stream training paradigm jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. The HOI branch is removed at inference for zero-overhead RGB generation.
Key Results: CoInteract significantly outperforms existing methods (AnchorCrafter, Phantom, Humo, etc.) in interaction plausibility (VLM-QA: 0.72), hand structural quality (HQ: 0.724), and temporal coherence (Smooth: 0.9951), while maintaining high identity preservation.
Efficiency: The asymmetric co-attention design allows the HOI branch to be discarded at inference, reducing cost from 4.13× to 1.04× compared to retaining it.
Validation: Extensive quantitative metrics, qualitative comparisons, and a user study (24 evaluators) confirm CoInteract's superiority in object consistency, human/background consistency, and interaction plausibility.

Introduction and Theoretical Foundation

Human-Object Interaction (HOI) video synthesis is a critical frontier for applications in e-commerce, digital advertising, and virtual marketing, requiring coordinated hand movements, precise object manipulation, and strict physical plausibility beyond what existing talking-avatar generation methods provide.

Limitations of Prior Work: Current approaches fall into two paradigms, both with significant drawbacks:

Multi-condition generation: Methods extract per-frame human poses and object conditions to guide generation, but require heavy preprocessing and lack robustness/generalization.
Multi-reference generation: Methods condition on person/product references but typically lack explicit mechanisms to enforce interaction structure, leading to implausible human-object interactions (e.g., hand-object interpenetration).

Root Cause: The RGB-centric nature of current diffusion backbones. Models trained purely on pixel-level supervision have no built-in notion of 3D hand-object spatial relationships or body structure, leading to:

Structural collapse in hands and faces (fingers merge, facial features blur)
Physical violations (human-object interpenetration)

Core Philosophy: A model must not only "see" pixels but also "understand" the underlying structural and interaction relationships. CoInteract embeds human structural priors and HOI physical constraints directly into the DiT backbone, transforming it from a pure appearance generator into a structure-aware interaction engine.

Methodology

CoInteract is an end-to-end framework for speech-driven HOI video synthesis. Given dual reference images (character identity $I_{ref}$ and product $I_{prod}$ ) together with motion frames $V_{mot}$ that preserve temporal continuity, the goal is to synthesize HOI videos that are structurally stable and physically plausible.

1. Unified RGB–HOI Co-Generation

The framework introduces a unified co-generation paradigm where an RGB appearance stream $z_r$ and an auxiliary HOI structure stream $z_h$ are jointly trained within a single DiT backbone.

HOI Structure Stream Construction: An auxiliary HOI structure stream is constructed as a silhouette-like 3-channel rendering obtained by:

Projecting the recovered human mesh (from SAM3D-body) to the image plane.
Fusing the projected object mask (from SAM3). This produces a pixel-aligned structural target that highlights interaction boundaries while discarding RGB texture.

Joint Flow-Matching Objective: The model is optimized with a joint flow-matching objective supervising both streams:

\mathcal{L}_{flow} = \mathcal{L}_r + \lambda_h \mathcal{L}_h.

\mathcal{L}_r = \mathbb{E}_{t, z_0, z_1} \left[ \| v_r - v_\theta(z_{r,t}, t, c) \|_2^2 \right],

\mathcal{L}_h = \mathbb{E}_{t, z_0, z_1} \left[ \| v_h - v_\theta(z_{h,t}, t, c) \|_2^2 \right].

where $v$ denotes the target velocity field, $t$ is the diffusion timestep, and $c$ denotes conditioning (text, audio, dual reference images, and motion latents). $\lambda_h = 1$ unless otherwise stated.

2. Multi-Modal Coordinate Assignment via 3D RoPE

To seamlessly integrate heterogeneous modalities, each token is assigned a 3D coordinate $(h, w, t)$ encoded by 3D Rotary Positional Encoding (3D RoPE):

\text{Pos}(x_{i,j,k}) = \text{RoPE}_{3D}(h_i, w'_j, t_k),

where $w'_j$ accounts for the virtual width shift in the HOI stream. Key inductive biases:

Spatial coordinates for dual streams: RGB and HOI streams are concatenated along width dimension with distinct horizontal coordinates (e.g., $w \in [0, W]$ for RGB, $w \in [-W, 0]$ for HOI) while sharing identical height and time indices.
Temporal causality and reference anchoring:
- Historical motion frames ( $t < 0$ ): Assigned negative temporal indices ( $t \in \{-N, ..., -1\}$ ).
- Static reference images: Mapped to far-field temporal location ( $t = 30, 31$ ) to treat them as global identity anchors.

3. Two-Stage Asymmetric Co-Attention

To inject interaction-structure supervision while maintaining inference efficiency, a two-stage training strategy with an Asymmetric Co-Attention mechanism is used.

Stage 1: Standard bidirectional attention across both streams for rapid convergence.

Stage 2: Enforce an asymmetric attention mask. Let $T_r$ and $T_h$ denote token sets for RGB and HOI streams. The mask $M$ is defined as:

M_{i,j} = \begin{cases} 1, & \text{if } i \in T_r, j \in T_r, \\ 1, & \text{if } i \in T_h, j \in T_r \cup T_h, \\ 0, & \text{otherwise}. \end{cases}

Under this mask:

RGB queries attend only to RGB tokens (making RGB pathway independent of HOI branch at inference)
HOI queries attend to both streams, leveraging cleaner RGB features to predict interaction structure

Crucially, $\mathcal{L}_h$ backpropagates through the HOI ← RGB cross-attention into shared DiT parameters, transferring interaction-structure supervision to the RGB generator even when the HOI branch is removed.

4. Human-Aware Mixture-of-Experts (MoE)

A Human-Aware MoE module routes tokens to region-specialized experts via a spatially supervised router $R$ . It includes:

A Shared expert that reuses the original DiT FFN as a shortcut path
Three lightweight experts (Head, Hand, Base) implemented as small FFNs

Spatially Supervised Routing: To prevent router optimization from interfering with DiT representation learning, a stop-gradient operation $\text{sg}[\cdot]$ is applied:

G(x_i) = \text{Softmax}(W_g \cdot \text{sg}[h_i]).

Using face and hand bounding boxes, the router assigns tokens inside corresponding regions to $E_{head}$ or $E_{hand}$ , while remaining tokens go to the base expert. Specialization is enforced via a cross-entropy routing loss:

\mathcal{L}_{route} = -\sum_i \sum_{k \in \{\text{head},\text{hand},\text{base}\}} \mathbb{1}(y_i = k) \log(G(x_i)_k),

where $y_i$ is the ground-truth region label and $\mathbb{1}(\cdot)$ is the indicator function.

Total Training Objective:

\mathcal{L}_{total} = \mathcal{L}_{flow} + \eta \mathcal{L}_{route}.

5. Data Curation and Representation

Training data is transformed into paired RGB and HOI-structure representations:

Decouple entities using Qwen-Edit to create independent person and product references.
Validate triplets (source image, person, object) to filter mismatches.
Use SAM3 for object masks and SAM3D-body for human mesh recovery.
Fuse projected human rendering with object mask to form texture-stripped HOI structure stream $V_h$ .
Encode both RGB video $V_r$ and HOI stream $V_h$ into shared latent space via pre-trained VAE.
Use off-the-shelf detectors to obtain face and hand bounding boxes for MoE router supervision.

Empirical Validation / Results

4.1 Training Details

Dataset: 40 hours of product demonstration/live-streaming videos → 12K high-quality clips with paired RGB-HOI representations.
Implementation: Initialized from WanS2V. Human-Aware MoE has four experts (Shared, Head, Hand, Base). AdamW with learning rate $1 \times 10^{-4}$ . Two-stage training: 5K iterations full attention, 2K iterations asymmetric co-attention. $\lambda_h = 1$ , $\eta = 1$ .

4.2 Quantitative Comparison

Baselines: AnchorCrafter, Phantom, Humo, InteractAvatar, SkyReels-V3, VACE.

Evaluation Metrics:

Video Quality: AES↑ (aesthetics), IQ↑ (perceptual quality), Smooth↑ (temporal coherence)
Human-Object Interaction: VLM-QA↑ (Gemini-3-Pro assessment), HQ↑ (hand keypoint confidence)
Reference Consistency: DINO_id↑, DINO_obj↑, FaceSim↑
Audio-Visual Alignment: Sync_conf↑

Key Results Table:

Method	Video Quality	HOI	Reference	Audio
	AES↑	IQ↑	Smooth↑	VLM-QA↑
AnchorCrafter	0.448	0.643	0.9743	0.22
Phantom	0.579	0.724	0.9916	0.50
Humo	0.565	0.741	0.9919	0.56
VACE	0.530	0.733	0.9904	0.46
InteractAvatar	0.528	0.722	0.9938	0.62
SkyReels-V3	0.563	0.720	0.9861	0.44
CoInteract	0.554	0.749	0.9951	0.72

CoInteract achieves best or competitive results across most metrics, with highest VLM-QA (0.72) and HQ (0.724) for interaction plausibility and hand stability.

4.3 Qualitative Results

CoInteract consistently produces videos with coherent hand articulation, natural product grasping, and faithful prompt adherence. Other baselines exhibit:

Hand-object interpenetration
Inconsistent product appearance
Background deviation from reference
Identity drift on unseen objects

4.4 User Study

24 evaluators ranked methods (lower is better) on three criteria:

Criterion	AnchorCrafter	Phantom	Humo	VACE	InteractAvatar	SkyReels-V3	CoInteract
Obj. Consist.↓	6.08	4.13	4.42	3.54	3.08	4.58	2.17
Hum/BG ↓	6.28	4.38	4.21	3.46	2.92	4.83	1.92
Interact. ↓	6.55	4.29	3.92	3.58	3.33	4.54	1.79

CoInteract achieves the best (lowest) mean rank across all criteria, with largest advantage on Interaction Plausibility.

4.5 Ablation Study

Four model variants compared:

Variant	Video Quality	HOI	Reference	Audio	Infer. Cost
	AES	IQ	Smooth	VLM-QA	HQ
w/o MoE	0.541	0.736	0.993	0.66	0.658
w/o Co-Gen	0.536	0.753	0.991	0.48	0.706
w/o Asym. Mask	0.548	0.742	0.994	0.76	0.738
Full Model	0.554	0.749	0.995	0.72	0.724

Key Findings:

Removing MoE degrades HQ (0.724→0.658) and FaceSim (0.696→0.662)
Removing HOI stream causes largest drop in VLM-QA (0.72→0.48, -33.3%)
Retaining HOI branch at inference gives slightly better VLM-QA (0.76) but at 4.13× cost
Asymmetric strategy trades marginal gains for dramatic efficiency improvement

Theoretical and Practical Implications

Theoretical Implications:

Structure-Aware Generation: Demonstrates that embedding explicit structural priors and interaction geometry constraints directly into diffusion backbones is more effective than post-hoc processing or external conditioning.
Dual-Stream Learning: Shows that jointly training appearance and structure streams with asymmetric attention can transfer structural knowledge to the appearance generator while maintaining inference efficiency.
Region-Specialized Processing: Validates that spatially-guided MoE routing can effectively address the challenge of high-frequency detail in anatomically sensitive regions (hands, faces) with minimal parameter overhead.

Practical Implications:

E-Commerce and Marketing: Enables high-quality product demonstration videos with physically plausible interactions, reducing the need for expensive human filming.
Virtual Assistants: Improves the realism of virtual agents performing object manipulation tasks.
Content Creation: Provides an easy-to-use end-to-end framework for generating HOI videos without heavy preprocessing or post-processing.
Efficiency: The zero-overhead inference design makes the approach practical for real-world deployment despite the dual-stream training complexity.

Conclusion

CoInteract presents a structure-aware framework for speech-driven HOI video synthesis that prioritizes structural integrity and physical consistency. The framework's key innovations are:

Human-Aware MoE: Enhances fidelity of hands and faces through spatially-supervised routing to region-specialized experts.
Spatially-Structured Co-Generation: Uses an asymmetric co-attention mask to learn physical interaction priors during training while maintaining zero-overhead inference.

Extensive experiments demonstrate that CoInteract consistently outperforms existing methods in interaction plausibility and structural stability, advancing the quality of HOI video generation. The approach effectively reduces hand-object interpenetration and geometric misalignment while maintaining high identity preservation and temporal coherence.

Future Directions:

Extension to more complex multi-object interactions
Integration with 3D-aware generation for improved geometric consistency
Application to broader domains beyond product demonstration
Exploration of more efficient expert routing mechanisms