CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

Summary (Overview)

Core Contribution: Presents CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into two stages: creative intent cognition via a specialized Vision-Language Model (CogVLM) and intent-to-video synthesis via a unified diffusion transformer (CogOmniDiT).
Key Innovation 1: Specialized Reasoning Model (CogVLM). A VLM fine-tuned on professional animation production data to accurately interpret sparse, abstract, or conflicting multimodal conditions (e.g., storyboards, clay renders) and output dense, professional reasoning about the creative intent.
Key Innovation 2: Unified Generation with RL Alignment (CogOmniDiT). A video DiT that unifies diverse control signals and high-level semantic features from CogVLM via in-context learning, and is further aligned to the reasoning outputs through Reinforcement Fine-Tuning (RFT).
Closed-Loop System: Extends the framework into a "harness-like" architecture where CogVLM not only guides generation but also plans specific evaluators for adaptive Best-of-N selection, forming a Reasoning-Generation-Verification loop.
New Benchmarks: Introduces CogReasonBench (for VLM reasoning evaluation) and CogControlBench (for video generation evaluation), built from real-world professional workflow data carrying genuine creative intent, not simulated conditions.

Introduction and Theoretical Foundation

Recent diffusion-based video generative models have achieved strong photorealism and motion fluency. The field is moving towards omni-level controllable generation, aiming for a single system that supports multimodal inputs, professional intent conditions, and abstract constraints. Existing approaches either inject conditions through adapters or couple a generic Vision-Language Model (VLM) with a diffusion backbone. However, these methods face significant challenges:

Cognitive Gap: Generic VLMs struggle to fully comprehend the underlying creative intent from complex, abstract, or conflicting multimodal control signals common in professional workflows (e.g., storyboard sketches, clay renders). They fail to formulate generation plans grounded in domain-specific creative knowledge.
Alignment Gap: It is unclear whether VLM outputs under abstract conditions are properly aligned with the generated videos. Using reasoning from a generic VLM can introduce noise into the generation process.

The paper posits that the core challenge is bridging the gap between abstract pixel-level conditions and high-level creative intent. The proposed solution, CogOmniControl, explicitly factorizes the generation process into reasoning and generation components, formalized as:

P(V | C) = \underbrace{P(V | R, C)}_{\text{Generation}} \cdot \underbrace{P(R | V_{ctrl}, I_{ref}, T_{desc})}_{\text{Reasoning}}

where $V$ is the generated video, $C = \{ V_{ctrl}, I_{ref}, T_{desc} \}$ is the multimodal condition set (control video, reference image, text description), and $R$ is the reasoning output from CogVLM.

Methodology

The CogOmniControl framework consists of two core modules: CogVLM for intent cognition and CogOmniDiT for video synthesis.

1. CogVLM: Cognizing Creative Intent from Multimodal Conditions

CogVLM acts as a "professional director," interpreting multi-modal drafts to formulate explicit production schemes. It is trained via a two-stage process:

Supervised Fine-Tuning (SFT): Trained using LoRA on a dataset of professional workflow data (storyboards, clay renders + final videos) to learn domain-specific reasoning.
Reinforcement Fine-Tuning (RFT): Optimized using a reward function that combines:
- Holistic Reward ( $R_{holistic}$ ): Assesses alignment across four critical dimensions (Creative Intent, Physical Plausibility, Information Integrity, Motion description) using a judge VLM.
- Accuracy Reward ( $R_{acc}$ ): Ensures factual grounding by verifying the reasoning output against binary questions derived from the input conditions.

The reward functions are defined as:

R_{holistic} = \sum_{k \in \mathcal{K}} w_k \cdot \text{VLM}_k(R, C), \quad \mathcal{K} = \{\text{intent, phys, info, dyn}\}

R_{acc} = \frac{1}{N} \sum_{i=1}^{N} \text{VLM}(R, q_i)

2. CogOmniDiT: Unified Video Diffusion Transformer

CogOmniDiT is based on a DiT backbone and unifies heterogeneous conditions and noisy latents into a single sequence for in-context learning. The input sequence is constructed as:

\text{Input Sequence} = \text{Concat}(Z_t, Z_{ref}, Z_{ctrl}, \text{Emb}_{VLM})

where $Z_t$ , $Z_{ref}$ , and $Z_{ctrl}$ are the noisy latent, reference image latent, and control video latent, respectively. $\text{Emb}_{VLM}$ is the embedding from CogVLM. CogOmniDiT is also trained with RFT using a visual reward ( $R_{visual}$ ) to enforce adherence to both pixel-level conditions and high-level reasoning:

R_{visual} = \sum_{m \in \mathcal{M}} w_m \cdot \text{VLM}_m(V, R, C), \quad \mathcal{M} = \{\text{condition following, video quality}\}

3. Closed-Loop Verification with Evaluator Harness

Moving beyond static Best-of-N selection, CogOmniControl enables adaptive evaluator selection. In a single forward pass, CogVLM outputs both the reasoning $R$ and a harness $H$ specifying which evaluators to use:

(R, H) \sim \pi_{\text{CogVLM}}(\cdot|C)

Multiple videos $\{V_1, V_2, ..., V_n\}$ are generated, and the best one is selected by maximizing the score from the adaptively chosen evaluators:

V^* = \arg \max_{V_i \in \{V_1, V_2,..., V_n\}} S(V_i; H)

The evaluators are selected from a predefined library (e.g., Artifact Detector, Storyboard Annotation Evaluator, Cross-modal Causality Evaluator) based on the input conditions.

Empirical Validation / Results

Experiments were conducted using Qwen3-VL-8B-Thinking as the base VLM and Wan2.2-T2V-14B as the base DiT. Evaluation was performed on the newly introduced benchmarks.

1. CogVLM Performance on CogReasonBench

CogVLM significantly outperforms generic VLMs in understanding creative intent from multimodal inputs.

Table 2: The results of CogVLM on CogReasonBench.

Models	MM Intent	Physics	Integrity	Motion	Avg
Qwen3-VL-8B-Instruct	2.480	4.045	3.905	4.420	3.712
Qwen3-VL-8B-Thinking	2.670	3.824	3.829	4.727	3.752
CogVLM (SFT)	3.725	4.445	4.266	4.955	4.343
CogVLM (RFT)	3.985	4.449	4.599	4.959	4.473

2. Video Generation Performance on CogControlBench

CogOmniControl achieves the highest average score among open-source models and narrows the gap with proprietary systems. The adaptive Best-of-N selection (Harness BoN) provides further improvement.

Table 3: The comparison on CogControlBench. (Abbreviated; AQ=Aesthetic Quality, IQ=Image Quality, TF=Temporal Flickering, MS=Motion Smoothness, DD=Dynamic Degree, MI=Multimodal Intent, AF=Appearance Follow, SF=Style Follow, CF=Content Follow, DF=Dynamic Follow, MN=Motion Naturalness, IC=Identity Consistency, DP=Dynamic Plausibility)

Models	Speciesist Metrics Avg	VLM-as-a-Judge Metrics Avg	Overall Avg
Proprietary Models
Seedance2.0	0.746	0.753	0.750
Kling-3 O	0.738	0.669	0.704
Open-Source Models
VINO	0.680	0.692	0.686
VACE-Wan2.1	0.735	0.595	0.665
OmniWeaving	0.683	0.531	0.607
CogOmniControl	0.738	0.716	0.727
CogOmniControl (BoN)	0.742	0.724	0.733
CogOmniControl (Harness BoN)	0.743	0.741	0.742

3. Ablation Studies

Ablation confirms the importance of both specialized VLM reasoning (CogVLM) and RL alignment (CogOmniDiT RFT).

Table 4: Ablation studies of CogOmniControl on CogControlBench.

Models	Multimodal Intent (MI)
Qwen3-VL-8B-Thinking + CogOmniDiT(SFT)	3.142
CogVLM(SFT) + CogOmniDiT(SFT)	3.397
CogVLM(RFT) + CogOmniDiT(SFT)	3.586
CogVLM(RFT) + CogOmniDiT(RFT)	3.588

4. Qualitative Results

Visual comparisons (Figs. 4 & 5 in the paper) show that:

Adapter-based methods (e.g., VACE) produce significant artifacts and semantic misalignment when given sparse controls like clay renders.
Models with generic VLMs (e.g., VINO, OmniWeaving) often misunderstand intent, produce wrong visual effects, or generate nearly static outputs.
CogOmniControl successfully generates high-quality videos that correctly follow the creative intent from abstract conditions while maintaining visual consistency and smooth motion.

Theoretical and Practical Implications

Theoretical: Proposes a novel factorization of the controllable video generation problem into explicit reasoning and generation stages. It demonstrates the necessity of domain-specific reasoning models over generic VLMs for professional creative tasks and shows how reinforcement learning can effectively align diffusion models with high-level semantic guidance.
Practical: Provides a robust open-source framework for professional video production workflows (e.g., animation, storyboard-to-video). The closed-loop evaluator harness introduces an efficient, adaptive method for quality control. The release of CogReasonBench and CogControlBench fills a gap by providing evaluation data rooted in real creative intent, not synthetic simulations.

Conclusion

CogOmniControl bridges the gap between abstract conditions and faithful video generation by introducing a reasoning-driven framework. It employs a specialized CogVLM to cognize creative intent from multimodal inputs and a unified CogOmniDiT, aligned via RL, to synthesize intent-aligned videos. The extension into a closed-loop system with adaptive evaluator harnesses further enhances performance. Evaluated on new benchmarks built from professional data, CogOmniControl outperforms existing open-source models and narrows the performance gap with leading proprietary systems. This work paves the way for more intelligent, intent-aware video generation tools for professional creators.