Video-CoE: Reinforcing Video Event Prediction via Chain of Events - Summary

Summary (Overview)

Proposes Chain of Events (CoE) Paradigm: A novel method that enhances Multimodal Large Language Models (MLLMs) for Video Event Prediction (VEP) by constructing fine-grained temporal event chains from videos to enforce visual grounding and logical reasoning.
Identifies Key MLLM Limitations: Through systematic evaluation, reveals that current MLLMs struggle with VEP due to 1) a lack of logical reasoning to connect observed video content to future events, and 2) insufficient utilization of visual information, over-relying on textual cues.
Introduces Two-Stage Training (CoE-SFT & CoE-GRPO): An efficient training approach that first teaches logical connection via supervised fine-tuning (CoE-SFT) and then reinforces fine-grained temporal modeling via a novel Group Relative Policy Optimization (CoE-GRPO) with tailored rewards.
Achieves State-of-the-Art Performance: The proposed method, built on Qwen2.5-VL, significantly outperforms leading open-source and commercial MLLMs (e.g., GPT-4o, Qwen3-VL) on public VEP benchmarks (FutureBench, AVEP), establishing new SOTA results.
Validates Effectiveness via Comprehensive Analysis: Demonstrates through attention analysis, ablation studies, and a novel open-set judge model evaluation that CoE successfully increases visual attention and enables logical reasoning for future event prediction.

Introduction and Theoretical Foundation

Video Event Prediction (VEP) is a challenging task that requires a model to anticipate plausible future events based on an observed video, which is crucial for applications like crisis early warning. While Multimodal Large Language Models (MLLMs) excel at many vision tasks, their performance on VEP remains underexplored and suboptimal. This paper first conducts a systematic evaluation of state-of-the-art MLLMs, identifying two core limitations:

Lack of Logical Reasoning for Future Events: Models tend to analyze textual answer options superficially rather than establishing a causal-temporal link from the video evidence to a future outcome.
Insufficient Utilization of Visual Information: Attention analysis reveals a strong bias towards textual tokens, with models allocating minimal attention to visual content, hindering fine-grained temporal modeling essential for forecasting.

Theoretical work in event modeling suggests that constructing Event Chains—temporal sequences of events—is effective for prediction. Building on this, the authors propose the Chain of Events (CoE) paradigm to address MLLMs' shortcomings. The paradigm formalizes the VEP process to first construct an event chain from the video and then reason jointly over the video and this chain to predict the future.

Methodology

The core methodology involves the CoE paradigm and a two-stage training protocol to implement it.

1. Chain of Events (CoE) Paradigm: An event $E$ is defined as a pair $E = ( \mathcal{T} , \mathcal{D} )$ , where $\mathcal{T}$ denotes timestamps and $\mathcal{D}$ a textual description. A temporal event chain $EC$ is a sequence $EC = [ E_1, E_2, ..., E_n ]$ . The paradigm modifies the standard prediction process:

Vanilla: $P = P( \hat{E} | V, Q, R )$ where $R = \text{MLLM}_{\text{reason}}(V, Q)$ .
CoE: The model first constructs an event chain $EC = \text{MLLM}_{\text{CoE}}(V)$ , then reasons $R' = \text{MLLM}_{\text{reason}}(V, Q, EC)$ . The final prediction is: $P = P( \hat{E} | V, Q, R', EC)$

2. Two-Stage Training:

CoE-SFT (Supervised Fine-Tuning): Focuses on building logical reasoning. A powerful MLLM (Qwen2.5-VL-72B) is prompted to generate the reasoning process that connects a given video and its correct future event, avoiding option analysis. This small, high-quality dataset is used for fine-tuning to instill logical reasoning capabilities.
CoE-GRPO (Group Relative Policy Optimization): An enhanced GRPO framework designed to unlock temporal localization and enforce event chain construction. The model learns to output events within special tags: $E = \text{<event>Time: } t_{\text{start}} - t_{\text{end}}, \text{Des: } \mathcal{D} \text{ </event>}$ It uses a composite reward function for policy optimization: $r_i = \alpha r^{(i)}_a + \beta r^{(i)}_e + (1 - \alpha - \beta) r^{(i)}_s$
- Accuracy Reward ( $r_a$ ): 1 if the final answer is correct, else 0.
- CoE Reward ( $r_e$ ): Encourages proper event tag structure and controls chain length. $r^{(i)}_e = \lambda I(o_i) + (1 - \lambda)[L - | \text{len}(o_i) - L | + b]$
- Similarity Reward ( $r_s$ ): Ensures alignment between event descriptions and video clips by computing cross-modal similarity. $r_s = \frac{1}{n} \sum_{j=1}^{n} s_j, \quad \text{where } s_j = \cos(v_j, t_j)$
The policy is updated using the advantage $A_i$ $A_{i}$ calculated from group-normalized rewards, following the GRPO objective.

Empirical Validation / Results

The method is evaluated on Qwen2.5-VL-3B/7B models using the FutureBench and AVEP benchmarks.

Main Results:

FutureBench (Table 1): CoE-GRPO-7B achieves an overall average (AVG) accuracy of 75.00%, significantly outperforming the base Instruct model (52.94%), vanilla SFT (64.39%), and vanilla GRPO (67.28%). It also surpasses all other open-source and commercial MLLMs, including Qwen3-VL-30B (66.86%) and GPT-4o (59.04%).
AVEP (Table 2): CoE-GRPO-7B achieves the highest F1-Score for Action prediction (8.29 Test / 9.88 Val) and competitive scores for Noun and Verb components, demonstrating comprehensive improvement.

Key Findings:

Enhanced Visual Attention (Table 3, Figure 4): CoE-SFT and CoE-GRPO show a substantial increase in attention to visual tokens (Improvement Rate of +15.11% and +9.20%, respectively), whereas standard SFT reduces visual attention.
Superior Reasoning in Open-Set Evaluation (Table 4): In a judge model evaluation simulating open-set scenarios, CoE-SFT achieved the highest win rate (38.13%), indicating its reasoning is more logical and accurate.
Ablation Studies (Table 5):
- Common visual attention enhancement methods (Prompt-guided, Constant-Bias) degraded performance.
- The optimal group size $G$ for CoE-GRPO is 4, balancing cost and performance.
- An intermediate event chain length $L=3$ works best; too short or long chains harm performance.
- The similarity reward $r_s$ is crucial, as removing it leads to a performance drop (~3% on AVG).
- Video-text similarity models (VideoCLIP-XL) work best for $r_s$ , but CLIP is also effective.

Theoretical and Practical Implications

Theoretical Implications:

Provides a formal, structured paradigm (CoE) for integrating temporal event modeling into MLLM reasoning, moving beyond frame-level or option-based analysis.
Demonstrates that targeted reinforcement learning with dense, multi-component rewards (accuracy, structure, cross-modal alignment) can effectively teach MLLMs complex, structured output behaviors like event chain construction.

Practical Implications:

Offers an efficient pathway to significantly boost MLLM performance on the practically valuable VEP task without requiring large-scale annotated datasets or full model retraining.
The proposed two-stage training (SFT+GRPO) and reward design can serve as a blueprint for enhancing MLLMs on other tasks requiring temporal reasoning and fine-grained visual grounding.
Establishes comprehensive baselines and a rigorous evaluation framework (including judge-based open-set assessment) to propel future VEP research.

Conclusion

This work addresses the underexplored challenge of Video Event Prediction with MLLMs. By diagnosing key failure modes—poor logical reasoning and visual under-utilization—the authors introduce the Chain of Events (CoE) paradigm. The accompanying two-stage CoE-SFT and CoE-GRPO training protocol efficiently teaches models to perform fine-grained temporal modeling and reason from visual evidence to future events. Extensive experiments confirm the method's effectiveness, achieving state-of-the-art results across benchmarks and providing a solid foundation for future research in video-based future reasoning.