DA-Flow: Degradation-Aware Optical Flow Estimation with Diffusion Models

Summary (Overview)

Introduces the new task of Degradation-Aware Optical Flow, which aims to estimate accurate dense motion fields from severely corrupted videos (e.g., with blur, noise, compression).
Proposes lifting a pretrained image restoration diffusion model (DiT4SR) by injecting full spatio-temporal attention across frames, enabling it to produce degradation-aware and temporally-aware features that exhibit strong zero-shot correspondence capabilities.
Presents DA-Flow, a hybrid optical flow network built on RAFT that fuses upsampled diffusion features from the lifted model with conventional CNN encoder features within an iterative refinement framework.
Demonstrates that DA-Flow substantially outperforms existing optical flow methods (RAFT, SEA-RAFT, FlowSeek) on degraded versions of Sintel, Spring, and TartanAir benchmarks.

Introduction and Theoretical Foundation

Optical flow estimation is a fundamental dense correspondence problem, but real-world videos are often corrupted by motion blur, sensor noise, and compression artifacts. While recent studies like RobustSpring have benchmarked the robustness of flow models, a fundamental question remains: is accurate flow estimation from corrupted inputs possible? This task is ill-posed because degradations destroy fine textures and motion boundaries.

The authors shift focus from robustness to accuracy by formulating the new task of Degradation-Aware Optical Flow. The key insight is that intermediate features of diffusion models trained for image restoration are inherently corruption-aware, as they must learn to recover clean structures from degraded inputs. These features encode degradation patterns while preserving underlying scene geometry, offering a generative prior to reason beyond corrupted observations. However, they lack temporal awareness. Video restoration diffusion models are not suitable as they compress frames into a shared latent, losing the per-frame spatial structure needed for dense matching.

Therefore, the proposed solution is to start from a pretrained image restoration diffusion model and lift it to handle multiple frames via cross-frame attention, maintaining independent spatial latents for each frame while enabling temporal interaction.

Methodology

1. Problem Formulation

Given a low-quality (LQ) video $V^{LQ} = \{I_i^{LQ}\}_{i=1}^N$ and its corresponding high-quality (HQ) version $V^{HQ}$ , the goal is to learn a model $\mathcal{M}$ that estimates flow from consecutive LQ frames:

\hat{f}_{k \rightarrow k+1} = \mathcal{M}(I_k^{LQ}, I_{k+1}^{LQ}) \approx f^{*}_{k \rightarrow k+1}

where $f^{*}_{k \rightarrow k+1}$ is the ground-truth flow. The focus is on building a degradation-aware feature encoder.

القيادة

2. Lifting the Image Restoration Diffusion Model

The backbone is a Multi-Modal Diffusion Transformer (MM-DiT) for image restoration. To enable temporal reasoning, the model is extended with full spatio-temporal attention.

Original Processing: Frames are folded into the batch axis, and MM-Attention is applied independently per frame. Modality-specific (HQ, LQ, Text) projections produce queries, keys, and values: $(Q_m, K_m, V_m) = (F_m W^m_Q, F_m W^m_K, F_m W^m_V), \quad m \in \{\text{HQ}, \text{LQ}, \text{Text}\}$
Lifting: Reshape each modality stream from $F_m \in \mathbb{R}^{(BF) \times T \times C}$ to $\tilde{F}_m \in \mathbb{R}^{B \times (FT) \times C}$ , concatenating all spatial tokens across frames. This yields spatio-temporal queries, keys, and values $\tilde{Q}, \tilde{K}, \tilde{V} \in \mathbb{R}^{B \times (3FT) \times C}$ .
Full Spatio-Temporal MM-Attention: $\text{MM-Attn} = \text{softmax}\left(\frac{\tilde{Q}\tilde{K}^\top}{\sqrt{C}}\right)\tilde{V}$ Each token now attends to all spatial tokens across all frames and modalities. The lifted model $\mathcal{D}_\phi$ is then fine-tuned on the YouHQ dataset.

3. Diffusion Feature Analysis

To select the best layers for flow estimation, the zero-shot geometric correspondence of the lifted model's features is analyzed. Features are extracted from the full spatio-temporal attention layers: the query feature from frame $k$ and the key feature from frame $k+1$ in the HQ diffusion branch: $\tilde{Q}^k_{\text{HQ}}, \tilde{K}^{k+1}_{\text{HQ}} \in \mathbb{R}^{B \times T \times C}$ .

A cost volume $\mathcal{C} \in \mathbb{R}^{h \times w \times h \times w}$ is constructed via pairwise dot-product similarity:

\mathcal{C}(i, j) = \tilde{Q}^k_{\text{HQ}}(i) \cdot \tilde{K}^{k+1}_{\text{HQ}}(j)

Flow is obtained via $\hat{f}_{k \rightarrow k+1} = \text{softargmax}(\mathcal{C})$ and upsampled. Pseudo ground-truth flow, generated by applying a pretrained flow model to HQ frames, is used for evaluation.

Results (Fig. 3): The lifted model's features achieve consistently lower End-Point Error (EPE) than a baseline (unfine-tuned) version across layers and remain stable across denoising steps, whereas baseline features are highly sensitive to the extraction timestep.

4. DA-Flow Architecture

DA-Flow is built on RAFT, retaining its correlation $\mathcal{C}$ and iterative update $\mathcal{U}$ operators, but incorporating the lifted diffusion model $\mathcal{D}_\phi$ alongside a conventional CNN encoder $\mathcal{E}$ . The pipeline is:

\mathcal{M}_\theta = \mathcal{U} \circ \mathcal{C} \circ (\text{Up}(\mathcal{D}_\phi), \mathcal{E})

where $\text{Up}$ denotes a learnable upsampling stage.

Feature Upsampling: Diffusion features from the top- $L$ $L$ layers (selected via analysis) are aggregated and passed through separate DPT-based upsampling heads to recover higher-resolution ( $H/8 \times W/8$ $H /8 \times W /8$ ) features: $\begin{aligned} F^{k,\uparrow}_Q &= \text{DPT}_Q(\{\tilde{Q}^{k,l}_{\text{HQ}}\}_{l=1}^L), \\ F^{k+1,\uparrow}_K &= \text{DPT}_K(\{\tilde{K}^{k+1,l}_{\text{HQ}}\}_{l=1}^L), \\ F^{k,\uparrow}_{\text{ctx}} &= \text{DPT}_{\text{ctx}}(\{\tilde{Q}^{k,l}_{\text{HQ}}\}_{l=1}^L) \end{aligned}$
- Hybrid Feature Encoding: To compensate for the diffusion features' lack of fine-grained spatial detail, CNN features from the RAFT encoder are concatenated with the upsampled diffusion features:
$\begin{aligned} F^k &= \text{Concat}(F^k_{\text{img}}, F^{k,\uparrow}_Q), \\ F^{k+1} &= \text{Concat}(F^{k+1}_{\text{img}}, F^{k+1,\uparrow}_K), \\ F^k_{\text{h-ctx}} &= \text{Concat}(F^k_{\text{ctx}}, F^{k,\uparrow}_{\text{ctx}}) \end{aligned}$ $F^k$ $F^{k}$ and $F^{k+1}$ $F^{k + 1}$ are used to build the correlation volume; $F^k_{\text{h-ctx}}$ $F_{h-ctx}^{k}$ provides context for the update operator.
Loss Function: Since real-world degraded video ground truth is unavailable, pseudo ground-truth flow $f^{*}$ is generated from HQ frames. The model is trained with a multi-scale flow loss: $\mathcal{L}_{\text{flow}} = \sum_{i=1}^M \gamma^{M-i} \| f^{(i)}_{k \rightarrow k+1} - f^{*}_{k \rightarrow k+1} \|_1$ where $\gamma$ is a weight decay factor and $M$ is the number of refinement iterations.

Empirical Validation / Results

Experimental Setup

Training: Two-stage training on the YouHQ dataset. First, the lifted diffusion model is trained. Second, with $\mathcal{D}_\phi$ frozen, the flow network is trained. Pseudo ground-truth is generated using SEA-RAFT on HQ frames. LQ frames are generated using the RealBasicVSR/Real-ESRGAN degradation pipeline.
Evaluation: On degraded versions of Sintel, Spring, and TartanAir benchmarks. Metrics: End-Point Error (EPE) and outlier rates at 1px, 3px, 5px thresholds.

Quantitative Results

Table 1: Quantitative comparison on Sintel, Spring, and TartanAir. (EPE ↓, outlier rates ↓)

Model	Sintel EPE ↓	1px ↓	3px ↓	5px ↓	Spring EPE ↓	1px ↓	3px ↓	5px ↓	TartanAir EPE ↓	1px ↓	3px ↓	5px ↓
RAFT [31]	10.693	62.91	37.24	28.63	3.944	39.82	18.65	11.98	9.487	75.17	42.96	30.04
SEA-RAFT [37]	10.185	59.56	34.46	26.15	2.703	41.51	19.31	12.11	8.316	77.85	45.76	32.15
FlowSeek [25]	10.241	64.08	40.71	31.83	2.861	41.53	19.16	12.18	7.694	76.96	45.20	32.00
DA-Flow	6.912	55.80	28.10	20.91	2.207	30.95	13.87	8.91	8.866	72.35	37.61	25.40

DA-Flow achieves the best performance on Sintel and Spring across all metrics.
On TartanAir, DA-Flow achieves the best outlier rates at all thresholds (1px, 3px, 5px), indicating more accurate estimates for the majority of pixels, despite a slightly higher average EPE due to a small number of large-displacement outliers.

Qualitative Results (Figs. 4, 5, 6)

Under severe degradations, baseline methods (RAFT, SEA-RAFT, FlowSeek) produce noisy and inconsistent flow fields with artifacts around motion boundaries. DA-Flow consistently recovers sharp, coherent flow fields that closely match the ground truth, successfully localizing motion boundaries and maintaining structural coherence.

Ablation Studies

Table 2: Ablation on feature source across denoising steps. Compares DA-Flow (using lifted features) with a baseline variant (Baseline* using unfine-tuned features).

Dataset	Metric	Method	Step 0	Step 1	Step 2	Step 3	Step 4	Step 5	Step 6	Step 7	Step 8	Step 9
Sintel	EPE ↓	Baseline*	7.4145	7.4542	7.5076	7.4703	7.4124	7.4897	7.5243	7.5752	7.5709	7.6883
		DA-Flow	7.0210	6.7809	6.7196	6.7160	6.7433	6.7605	6.8029	6.8736	7.0641	7.6397
Spring	EPE ↓	Baseline*	2.3269	2.3457	2.2535	2.3073	2.2158	2.2030	2.2036	2.2148	2.2066	2.2343
		DA-Flow	2.2902	2.2119	2.2069	2.2026	2.2011	2.2043	2.2008	2.1928	2.1917	2.1720

DA-Flow consistently outperforms the baseline variant, confirming the contribution of the lifted, fine-tuned diffusion features.

Additional Ablations (Appendix):

Feature Type: Query/Key features outperform post-AdaNorm features for geometric correspondence.
Model Type: Features from image diffusion models (lifted) are superior to those from video diffusion models (FlashVSR) for this task.
Architecture: The combination of DPT upsampling and the CNN encoder is essential for optimal performance.
Fine-tuning Baseline: DA-Flow outperforms a straightforward fine-tuned version of RAFT on the same data.

Theoretical and Practical Implications

Theoretical: Introduces a new, challenging task (Degradation-Aware Optical Flow) and demonstrates that accurate estimation is possible by leveraging the inherent degradation-awareness and structural priors of restoration diffusion models. The lifting strategy provides a novel way to equip image-level generative models with temporal reasoning without collapsing the spatial structure needed for dense matching.
Practical: DA-Flow enables reliable optical flow estimation in real-world scenarios where video quality is poor, which is critical for applications like video restoration, stabilization, and analysis. The method shows strong potential for improving temporal consistency in video restoration pipelines when used for frame alignment.

Conclusion

DA-Flow addresses the problem of estimating accurate optical flow from severely corrupted videos by introducing a degradation-aware approach. The core innovation is lifting a pretrained image restoration diffusion model with spatio-temporal attention to obtain features that are both corruption-aware and geometrically informative for correspondence. By fusing these features with a conventional CNN encoder within the RAFT framework, DA-Flow achieves state-of-the-art performance on degraded optical flow benchmarks. A key limitation is the inference cost due to multiple denoising steps; future work could explore one-step distillation to improve efficiency.