UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Summary (Overview)

Unified Framework: UniVidX is a unified generative framework that enables flexible video generation across multiple aligned visual modalities (e.g., RGB, albedo, normal maps, alpha layers), supporting three paradigms: Text → X, X → X, and Text&X → X.
Core Designs: The framework introduces three key components: Stochastic Condition Masking (SCM) for dynamic input-output partitioning, Decoupled Gated LoRA (DGL) for modality-specific adaptation without disrupting backbone priors, and Cross-Modal Self-Attention (CMSA) for ensuring cross-modal consistency.
Effective Instantiations: The framework is validated through two models: UniVid-Intrinsic (for RGB and intrinsic maps: albedo, irradiance, normal) and UniVid-Alpha (for blended RGB and RGBA layers). Both achieve state-of-the-art performance across diverse tasks like inverse rendering, video matting, and text-to-intrinsic/RGBA generation.
Data Efficiency and Generalization: By leveraging the strong priors of a pre-trained Video Diffusion Model (VDM), both models demonstrate remarkable data efficiency, showing robust generalization to in-the-wild scenarios despite being trained on small-scale datasets (< 1k videos).

Introduction and Theoretical Foundation

Recent progress has shown that Video Diffusion Models (VDMs) can be repurposed for various multimodal graphics tasks. However, existing approaches typically train separate models for each specific problem (e.g., RGB → alpha, intrinsic → X), which locks models into fixed input-output mappings and ignores the joint correlations across modalities. This practice limits flexibility and can lead to cross-modal inconsistencies.

The paper poses a fundamental question: Can we design a unified generative framework that allows a video model to let different subsets of aligned modalities act as conditions or targets, enabling flexible generation across visual modalities? Achieving this is non-trivial and presents three challenges:

Mastering diverse task categories within a single conditional generation framework.
Adapting to distinct modality distributions while preserving the backbone's generative priors.
Guaranteeing alignment across diverse interacting modalities during joint generation.

To address these challenges, the authors propose UniVidX, a unified multimodal framework designed to leverage VDM priors for versatile video generation.

Methodology

The overall architecture is built upon a T2V backbone (Wan2.1-T2V-14B). Multimodal inputs are encoded via their respective VAE encoders and concatenated along the batch dimension. The framework incorporates three core designs:

1. Stochastic Condition Masking (SCM)

SCM dynamically partitions the set of all visual modality latents $Z$ into two subsets during training:

Target Subset $Z_{tgt}$ : Latents selected for generation. These are corrupted with noise.
Condition Subset $Z_{cond}$ : The complementary subset. These remain clean to serve as visual conditions. $Z_{cond}$ can be empty (for Text → X tasks).

This is implemented via timestep manipulation. For target latents $x_T$ , the noisy state $z^T_t$ is obtained via linear interpolation with Gaussian noise $\epsilon \sim \mathcal{N}(0, I)$ at timestep $t \in [0, 1]$ . Condition latents are fixed at $t=1$ as $z^C_1$ . The flow matching objective $\mathcal{L}_{uni}$ is formulated to predict the velocity field for the target subset:

\mathcal{L}_{uni} = \mathbb{E}_{t, x_T, \epsilon} \| v_{\theta}(z^T_t | z^C_1, c_{txt}) - v \|^2_2

where $\theta$ denotes model parameters, $v_{\theta}$ is the predicted velocity, and $v = x_T - \epsilon$ is the ground truth vector field. This enables the model to learn omni-directional generation.

2. Decoupled Gated LoRA (DGL)

To adapt the frozen pre-trained weights $W \in \mathbb{R}^{d \times d}$ to different modalities without interference, DGL assigns an independent LoRA to each modality $k$ :

\Delta W_k = B_k A_k

where $B_k \in \mathbb{R}^{d \times r}$ , $A_k \in \mathbb{R}^{r \times d}$ are learnable low-rank matrices ( $r \ll d$ ). Crucially, these LoRAs are gated. The effective weights $W'_k$ for modality $k$ are:

W'_k = W + m_k \cdot \Delta W_k

The gate $m_k$ is activated ( $m_k = 1$ ) only when modality $k$ serves as a generation target (noisy input). It is deactivated ( $m_k = 0$ ) when the modality is a condition (clean input), allowing the backbone's native encoding capability to extract features without domain-shift interference.

3. Cross-Modal Self-Attention (CMSA)

Standard self-attention processes each modality in isolation. CMSA facilitates interaction by sharing keys and values across modalities while keeping queries modality-specific. Let $q_i, k_i, v_i$ denote the query, key, and value for the $i$ -th modality. A shared context is constructed:

k_{shared} = [k_1, k_2, ..., k_n], \quad v_{shared} = [v_1, v_2, ..., v_n]

The attention for modality $i$ becomes:

\text{Attention}(q_i, k_{shared}, v_{shared}) = \text{Softmax}\left( \frac{q_i k_{shared}^T}{\sqrt{d_k}} \right) v_{shared}

This design ensures each modality is aware of the multimodal context, promoting cross-modal consistency.

Model Instantiations & Training

The framework is instantiated in two domains:

UniVid-Intrinsic: Models RGB videos and intrinsic maps: Albedo $A$ , Irradiance $I$ , and Normal $N$ .
UniVid-Alpha: Models blended RGB (BL), alpha matte (Alpha), foreground (FG), and background (BG) layers.

Training Details: Both models are built on the Wan2.1-T2V-14B backbone. LoRA rank is 32. They are trained with AdamW (lr=1e-4) on 4×H100 GPUs, processing 21-frame clips.

UniVid-Intrinsic Dataset: InteriorVid (924 synthetic indoor videos with ground-truth albedo, irradiance, normal).
UniVid-Alpha Dataset: VideoMatte240K (484 human-centric videos with alpha mattes).

Empirical Validation / Results

The models are evaluated on representative tasks. Quantitative results demonstrate state-of-the-art or competitive performance.

1. Text → X Generation

Text-to-Intrinsic (vs. IntrinsiX) & Text-to-RGBA (vs. LayerDiffuse): User studies (scale 1-10) and temporal flickering metrics show UniVidX models outperform image-based baselines.

Table 1: Quantitative comparison for text-to-intrinsic and text-to-RGBA generation.

Task	Method	Temporal Flickering	User Study (Visual Quality)	User Study (Text Alignment)	User Study (Modality Consistency)
Text-to-Intrinsic	IntrinsiX	-	RGB: 7.82, Alb: 8.44, Norm: 8.12	8.65	7.02
	Our UniVid-Intrinsic	0.9876	RGB: 9.34, Alb: 9.23, Norm: 9.17	9.04	9.29
Text-to-RGBA	LayerDiffuse	-	BL: 9.12, FG: 8.91, BG: 8.41	8.89	8.61
	Our UniVid-Alpha	0.9912	BL: 9.30, FG: 9.12, BG: 9.25	9.04	9.35

Qualitatively, UniVid-Intrinsic produces temporally coherent videos with precise cross-modal alignment, while UniVid-Alpha generates high-quality dynamic RGBA videos from a single shared prompt, unlike LayerDiffuse which requires distinct prompts per layer.

2. Inverse & Forward Rendering (X → X)

UniVid-Intrinsic is compared against baselines like Diffusion Renderer and Ouroboros on the InteriorVid-Test benchmark.

Table232: Quantitative comparison of inverse rendering and forward rendering.

Methods	Albedo (PSNR↑/LPIPS↓/SSIM↑)	Irradiance (PSNR↑/LPIPS↓/SSIM↑)	Normal (MAE↓/11.25°↑)	Forward Rendering (PSNR↑/LPIPS↓/SSIM↑)
RGB ↔ X	11.64 / 0.3324 / 0.6462	11.29 / 0.3734 / 0.7182	18.48 / 50.88	13.48 / 0.2728 / 0.6842
Diffusion Renderer	13.59 / 0.2624 / 0.6817	- / - / -	15.76 / 54.42	9.87 / 0.2920 / 0.6142
Ouroboros	14.21 / 0.2639 / 0.7063	9.73 / 0.4560 / 0.6460	14.52 / 57.58	13.15 / 0.2701 / 0.6700
Our UniVid-Intrinsic	16.89 / 0.2248 / 0.7812	13.46 / 0.3674 / 0.7895	11.09 / 70.52	15.31 / 0.2567 / 0.7031

UniVid-Intrinsic achieves the best performance across all metrics.

3. Albedo & Normal Estimation

Albedo on MAW Benchmark: UniVid-Intrinsic achieves top intensity error (0.44) and competitive chromaticity error (3.60), demonstrating good generalization from synthetic to real data.

Table 3: Albedo estimation results on MAW benchmark (lower is better).

Methods	Intensity (×100) ↓	Chromaticity ↓
... (Various baselines)	...	...
Liang et al.	0.46	3.53
Sun et al.	0.48	5.47
Our UniVid-Intrinsic	0.44	3.60

Normal on Sintel Benchmark: UniVid-Intrinsic achieves performance comparable to specialized models (e.g., NormalCrafter, Lotus) while using significantly less training data (19K vs. 860K frames).

Table 4: Normal estimation results on Sintel benchmark.

Methods	Training Frames	Mean ↓	11.25° ↑
DSINE	160K	34.9	21.5
NormalCrafter	860K	30.7	23.5
Lotus	59K	32.3	22.4
Ours	19K	33.5	21.6

4. Video Matting (X → X)

UniVid-Alpha, as an auxiliary-free method, is compared against both auxiliary-free (AF) and mask-guided (MG) video matting methods.

Table 5: Quantitative comparison of video matting (lower is better for all metrics).

Methods	MAD ↓	MSE ↓	Grad ↓	dtSSD ↓	Conn ↓
MG Methods
AdaM	4.80	0.76	2.15	1.45	0.30
MatAnyone	4.37	0.74	2.57	1.42	0.26
AF Methods
RVM	5.47	0.78	2.64	1.61	0.30
MODNet	10.11	4.80	5.53	2.44	0.81
Our UniVid-Alpha	4.24	0.69	1.86	1.39	0.52

UniVid-Alpha achieves state-of-the-art results, outperforming even MG methods.

Theoretical and Practical Implications

Ablation Studies validate the core designs:

Channel-Concatenation vs. Batch-Concatenation: Using channel-concatenation (common in other works) disrupts VDM priors and leads to structural collapse under limited data, while batch-concatenation preserves priors and enables data efficiency.
Decoupling in DGL: A shared-LoRA variant ('w/o Dec.') produces chaotic attention maps and fails at layer separation, proving the necessity of parameter decoupling for different modalities.
Gating in DGL: A 'w/o Gating' variant suffers from performance degradation (e.g., albedo PSNR drops by 1.87 dB), confirming that gating is essential to prevent LoRAs from interfering when a modality is a condition.
CMSA vs. Vanilla Attention: A 'w/ Van.' variant with standard self-attention suffers from cross-modal misalignment, while CMSA ensures consistency.

The Value of Multi-Condition: The framework's flexibility allows using auxiliary modalities (e.g., RGB + Albedo → Normal) as structural constraints to resolve perceptual ambiguity in ill-posed tasks like inverse rendering.

Downstream Applications: The versatile generation paradigm enables creative task compositions for practical graphics applications:

UniVid-Intrinsic: Video Relighting, Text-driven Video Retexturing, Material Editing.
UniVid-Alpha: Video Inpainting, Background Replacement, Foreground Replacement.

Conclusion

UniVidX presents a unified framework for versatile multimodal video generation by effectively leveraging VDM priors. Its core designs—Stochastic Condition Masking, Decoupled Gated LoRA, and Cross-Modal Self-Attention—enable flexible task support, preserve generative quality, and ensure cross-modal consistency. Instantiations as UniVid-Intrinsic and UniVid-Alpha demonstrate state-of-the-art performance across diverse tasks, exceptional data efficiency, and robust in-the-wild generalization. The work breaks the boundaries of isolated task-specific paradigms and provides a common recipe for aligned multimodal video modeling.

Limitations: 1) Intrinsic and alpha capabilities are currently separate due to lack of jointly annotated data. 2) Computational constraints limit processing to ~4 modalities, 21 frames, at 480p. 3) Performance on specific physical corner cases (e.g., transparent surfaces) can be affected by biases in the small training datasets, though the VDM backbone shows inherent priors to handle them.