CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
Summary (Overview)
- Core Problem: Existing diffusion models for Human-Object Interaction (HOI) video synthesis often fail on structural stability (hands/faces) and physically plausible contact (avoiding interpenetration).
- Proposed Solution: CoInteract, an end-to-end framework with two key innovations embedded into a Diffusion Transformer (DiT) backbone:
- Human-Aware Mixture-of-Experts (MoE): A lightweight router dispatches tokens to region-specialized experts (head, hand, base) using spatial supervision from bounding boxes, improving fine-grained structural fidelity.
- Spatially-Structured Co-Generation: A dual-stream training paradigm jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. The HOI branch is removed at inference for zero-overhead RGB generation.
- Key Results: CoInteract significantly outperforms existing methods (AnchorCrafter, Phantom, Humo, etc.) in interaction plausibility (VLM-QA: 0.72), hand structural quality (HQ: 0.724), and temporal coherence (Smooth: 0.9951), while maintaining high identity preservation.
- Efficiency: The asymmetric co-attention design allows the HOI branch to be discarded at inference, reducing cost from 4.13× to 1.04× compared to retaining it.
- Validation: Extensive quantitative metrics, qualitative comparisons, and a user study (24 evaluators) confirm CoInteract's superiority in object consistency, human/background consistency, and interaction plausibility.
Introduction and Theoretical Foundation
Human-Object Interaction (HOI) video synthesis is a critical frontier for applications in e-commerce, digital advertising, and virtual marketing, requiring coordinated hand movements, precise object manipulation, and strict physical plausibility beyond what existing talking-avatar generation methods provide.
Limitations of Prior Work: Current approaches fall into two paradigms, both with significant drawbacks:
- Multi-condition generation: Methods extract per-frame human poses and object conditions to guide generation, but require heavy preprocessing and lack robustness/generalization.
- Multi-reference generation: Methods condition on person/product references but typically lack explicit mechanisms to enforce interaction structure, leading to implausible human-object interactions (e.g., hand-object interpenetration).
Root Cause: The RGB-centric nature of current diffusion backbones. Models trained purely on pixel-level supervision have no built-in notion of 3D hand-object spatial relationships or body structure, leading to:
- Structural collapse in hands and faces (fingers merge, facial features blur)
- Physical violations (human-object interpenetration)
Core Philosophy: A model must not only "see" pixels but also "understand" the underlying structural and interaction relationships. CoInteract embeds human structural priors and HOI physical constraints directly into the DiT backbone, transforming it from a pure appearance generator into a structure-aware interaction engine.
Methodology
CoInteract is an end-to-end framework for speech-driven HOI video synthesis. Given dual reference images (character identity and product ) together with motion frames that preserve temporal continuity, the goal is to synthesize HOI videos that are structurally stable and physically plausible.
1. Unified RGB–HOI Co-Generation
The framework introduces a unified co-generation paradigm where an RGB appearance stream and an auxiliary HOI structure stream are jointly trained within a single DiT backbone.
HOI Structure Stream Construction: An auxiliary HOI structure stream is constructed as a silhouette-like 3-channel rendering obtained by:
- Projecting the recovered human mesh (from SAM3D-body) to the image plane.
- Fusing the projected object mask (from SAM3). This produces a pixel-aligned structural target that highlights interaction boundaries while discarding RGB texture.
Joint Flow-Matching Objective: The model is optimized with a joint flow-matching objective supervising both streams:
where denotes the target velocity field, is the diffusion timestep, and denotes conditioning (text, audio, dual reference images, and motion latents). unless otherwise stated.
2. Multi-Modal Coordinate Assignment via 3D RoPE
To seamlessly integrate heterogeneous modalities, each token is assigned a 3D coordinate encoded by 3D Rotary Positional Encoding (3D RoPE):
where accounts for the virtual width shift in the HOI stream. Key inductive biases:
- Spatial coordinates for dual streams: RGB and HOI streams are concatenated along width dimension with distinct horizontal coordinates (e.g., for RGB, for HOI) while sharing identical height and time indices.
- Temporal causality and reference anchoring:
- Historical motion frames (): Assigned negative temporal indices ().
- Static reference images: Mapped to far-field temporal location () to treat them as global identity anchors.
3. Two-Stage Asymmetric Co-Attention
To inject interaction-structure supervision while maintaining inference efficiency, a two-stage training strategy with an Asymmetric Co-Attention mechanism is used.
Stage 1: Standard bidirectional attention across both streams for rapid convergence.
Stage 2: Enforce an asymmetric attention mask. Let and denote token sets for RGB and HOI streams. The mask is defined as:
Under this mask:
- RGB queries attend only to RGB tokens (making RGB pathway independent of HOI branch at inference)
- HOI queries attend to both streams, leveraging cleaner RGB features to predict interaction structure
Crucially, backpropagates through the HOI ← RGB cross-attention into shared DiT parameters, transferring interaction-structure supervision to the RGB generator even when the HOI branch is removed.
4. Human-Aware Mixture-of-Experts (MoE)
A Human-Aware MoE module routes tokens to region-specialized experts via a spatially supervised router . It includes:
- A Shared expert that reuses the original DiT FFN as a shortcut path
- Three lightweight experts (Head, Hand, Base) implemented as small FFNs
Spatially Supervised Routing: To prevent router optimization from interfering with DiT representation learning, a stop-gradient operation is applied:
Using face and hand bounding boxes, the router assigns tokens inside corresponding regions to or , while remaining tokens go to the base expert. Specialization is enforced via a cross-entropy routing loss:
where is the ground-truth region label and is the indicator function.
Total Training Objective:
5. Data Curation and Representation
Training data is transformed into paired RGB and HOI-structure representations:
- Decouple entities using Qwen-Edit to create independent person and product references.
- Validate triplets (source image, person, object) to filter mismatches.
- Use SAM3 for object masks and SAM3D-body for human mesh recovery.
- Fuse projected human rendering with object mask to form texture-stripped HOI structure stream .
- Encode both RGB video and HOI stream into shared latent space via pre-trained VAE.
- Use off-the-shelf detectors to obtain face and hand bounding boxes for MoE router supervision.
Empirical Validation / Results
4.1 Training Details
- Dataset: 40 hours of product demonstration/live-streaming videos → 12K high-quality clips with paired RGB-HOI representations.
- Implementation: Initialized from WanS2V. Human-Aware MoE has four experts (Shared, Head, Hand, Base). AdamW with learning rate . Two-stage training: 5K iterations full attention, 2K iterations asymmetric co-attention. , .
4.2 Quantitative Comparison
Baselines: AnchorCrafter, Phantom, Humo, InteractAvatar, SkyReels-V3, VACE.
Evaluation Metrics:
- Video Quality: AES↑ (aesthetics), IQ↑ (perceptual quality), Smooth↑ (temporal coherence)
- Human-Object Interaction: VLM-QA↑ (Gemini-3-Pro assessment), HQ↑ (hand keypoint confidence)
- Reference Consistency: DINO_id↑, DINO_obj↑, FaceSim↑
- Audio-Visual Alignment: Sync_conf↑
Key Results Table:
| Method | Video Quality | HOI | Reference | Audio |
|---|---|---|---|---|
| AES↑ | IQ↑ | Smooth↑ | VLM-QA↑ | |
| AnchorCrafter | 0.448 | 0.643 | 0.9743 | 0.22 |
| Phantom | 0.579 | 0.724 | 0.9916 | 0.50 |
| Humo | 0.565 | 0.741 | 0.9919 | 0.56 |
| VACE | 0.530 | 0.733 | 0.9904 | 0.46 |
| InteractAvatar | 0.528 | 0.722 | 0.9938 | 0.62 |
| SkyReels-V3 | 0.563 | 0.720 | 0.9861 | 0.44 |
| CoInteract | 0.554 | 0.749 | 0.9951 | 0.72 |
CoInteract achieves best or competitive results across most metrics, with highest VLM-QA (0.72) and HQ (0.724) for interaction plausibility and hand stability.
4.3 Qualitative Results
CoInteract consistently produces videos with coherent hand articulation, natural product grasping, and faithful prompt adherence. Other baselines exhibit:
- Hand-object interpenetration
- Inconsistent product appearance
- Background deviation from reference
- Identity drift on unseen objects
4.4 User Study
24 evaluators ranked methods (lower is better) on three criteria:
| Criterion | AnchorCrafter | Phantom | Humo | VACE | InteractAvatar | SkyReels-V3 | CoInteract |
|---|---|---|---|---|---|---|---|
| Obj. Consist.↓ | 6.08 | 4.13 | 4.42 | 3.54 | 3.08 | 4.58 | 2.17 |
| Hum/BG ↓ | 6.28 | 4.38 | 4.21 | 3.46 | 2.92 | 4.83 | 1.92 |
| Interact. ↓ | 6.55 | 4.29 | 3.92 | 3.58 | 3.33 | 4.54 | 1.79 |
CoInteract achieves the best (lowest) mean rank across all criteria, with largest advantage on Interaction Plausibility.
4.5 Ablation Study
Four model variants compared:
| Variant | Video Quality | HOI | Reference | Audio | Infer. Cost |
|---|---|---|---|---|---|
| AES | IQ | Smooth | VLM-QA | HQ | |
| w/o MoE | 0.541 | 0.736 | 0.993 | 0.66 | 0.658 |
| w/o Co-Gen | 0.536 | 0.753 | 0.991 | 0.48 | 0.706 |
| w/o Asym. Mask | 0.548 | 0.742 | 0.994 | 0.76 | 0.738 |
| Full Model | 0.554 | 0.749 | 0.995 | 0.72 | 0.724 |
Key Findings:
- Removing MoE degrades HQ (0.724→0.658) and FaceSim (0.696→0.662)
- Removing HOI stream causes largest drop in VLM-QA (0.72→0.48, -33.3%)
- Retaining HOI branch at inference gives slightly better VLM-QA (0.76) but at 4.13× cost
- Asymmetric strategy trades marginal gains for dramatic efficiency improvement
Theoretical and Practical Implications
Theoretical Implications:
- Structure-Aware Generation: Demonstrates that embedding explicit structural priors and interaction geometry constraints directly into diffusion backbones is more effective than post-hoc processing or external conditioning.
- Dual-Stream Learning: Shows that jointly training appearance and structure streams with asymmetric attention can transfer structural knowledge to the appearance generator while maintaining inference efficiency.
- Region-Specialized Processing: Validates that spatially-guided MoE routing can effectively address the challenge of high-frequency detail in anatomically sensitive regions (hands, faces) with minimal parameter overhead.
Practical Implications:
- E-Commerce and Marketing: Enables high-quality product demonstration videos with physically plausible interactions, reducing the need for expensive human filming.
- Virtual Assistants: Improves the realism of virtual agents performing object manipulation tasks.
- Content Creation: Provides an easy-to-use end-to-end framework for generating HOI videos without heavy preprocessing or post-processing.
- Efficiency: The zero-overhead inference design makes the approach practical for real-world deployment despite the dual-stream training complexity.
Conclusion
CoInteract presents a structure-aware framework for speech-driven HOI video synthesis that prioritizes structural integrity and physical consistency. The framework's key innovations are:
- Human-Aware MoE: Enhances fidelity of hands and faces through spatially-supervised routing to region-specialized experts.
- Spatially-Structured Co-Generation: Uses an asymmetric co-attention mask to learn physical interaction priors during training while maintaining zero-overhead inference.
Extensive experiments demonstrate that CoInteract consistently outperforms existing methods in interaction plausibility and structural stability, advancing the quality of HOI video generation. The approach effectively reduces hand-object interpenetration and geometric misalignment while maintaining high identity preservation and temporal coherence.
Future Directions:
- Extension to more complex multi-object interactions
- Integration with 3D-aware generation for improved geometric consistency
- Application to broader domains beyond product demonstration
- Exploration of more efficient expert routing mechanisms