Summary of "Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?"

Summary (Overview)

Novel self-recovery paradigm: Proposes Robust-U1, the first framework that equips Multimodal Large Language Models (MLLMs) with explicit visual self-recovery capability, directly reconstructing clean images from corrupted inputs rather than relying on implicit feature alignment or text-only reasoning.
Three-stage training pipeline: (i) Supervised fine-tuning (SFT) on ImageNet-C for foundational reconstruction, (ii) Reinforcement learning (Flow-GRPO) with dual rewards — pixel-level SSIM and semantic-level CLIP similarity — to align structural and semantic fidelity, (iii) Multimodal reasoning that jointly considers both corrupted and recovered images for robust understanding.
State-of-the-art performance: Achieves significant improvements on R-Bench (overall 0.7398 vs. 0.5770 for base BAGEL) and maintains minimal performance drops under adversarial corruptions on MMMB, MMStar, and RealWorldQA.
Self-recovery enhances reasoning: Quantitative analysis confirms that high-quality visual recovery directly improves downstream reasoning performance, validating self-recovery as a critical mechanism for robust visual understanding.
Code released: Source code available at github.com/jqtangust/Robust-U1.

Introduction and Theoretical Foundation

Background and Motivation

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding through large-scale pretraining. However, their real-world deployment is critically hindered by vulnerability to visual corruptions — such as system noise, compression artifacts, and adverse weather — that severely disrupt visual features and cause dramatic performance degradation.

Limitations of Existing Approaches

Existing robustness enhancement methods fall into two categories:

Black-box feature alignment (e.g., TeCoA, Robust CLIP, Robust LLaVA): Align features of corrupted and clean images within the visual encoder via adversarial training. Limitation: lacks interpretability and fails to explicitly model the corruption process.
White-box text-based reasoning (e.g., Robust-R1): Uses explicit textual chains to describe corruption types and semantic impacts. Limitation: cannot represent pixel-level details; textual descriptions cannot restore lost visual information.

Research Question

Can MLLMs recover corrupted visual content by themselves?

The paper proposes that achieving self-recovery — where the model actively restores lost pixel-level information rather than merely offering text-based compensation — would establish a more intrinsic and complete form of robustness.

Theoretical Formulation

Consider a standard multimodal pipeline with clean image $I_o \in \mathbb{R}^{H \times W \times 3}$ and query $Q$ :

A_o = \mathcal{F}_{\text{MLLM}}(I_o, Q; \Theta)

where $\Theta$ denotes model parameters. Under real-world corruption $I_c = \mathcal{D}(I_o)$ , where $\mathcal{D}$ is the corruption function, performance degrades significantly.

The proposed robust formulation explicitly incorporates a visual self-recovery process:

A = \mathcal{F}_{\text{MLLM}}^{\text{(Robust)}}(I_c, Q; \Theta) = \mathcal{F}_{\text{MLLM}}\left(\underbrace{\mathcal{D}^{-1}(I_c)}_{I_r}, I_c, Q; \Theta\right)

where $\mathcal{D}^{-1}: I_c \mapsto I_r$ represents the self-recovery module approximating the inverse corruption process, and the model synthesizes information from both $I_c$ and $I_r$ for robust reasoning.

Methodology

Overview of Three-Stage Framework

The framework is built on BAGEL, a pre-trained unified MLLM supporting both multimodal understanding and generation.

Stage I: Supervised Fine-Tuning for Visual Self-Recovery (SFT)

Goal: Transform the model's general generative capability into a dedicated visual self-recovery module.
The recovery process is conditioned on a specific recovery prompt $P_{\text{rec}}$ .
Uses rectified flow formulation (Liu et al., 2023) in latent space.
Image is encoded into latent representation $Z_c$ ; model denoises a noisy version of clean latent $Z_o$ conditioned on $Z_c$ and $P_{\text{rec}}$ .
Objective function:

L_{\text{SFT}} = \mathbb{E}_{t \sim \mathcal{U}(0,1), \epsilon \sim \mathcal{N}(0, I)} \left[ \| \epsilon - \epsilon_\Theta(Z_c, Z_o(t), t, P_{\text{rec}}) \|^2 \right]

where $Z_o(t) = (1-t)Z_o + t\epsilon$ is the noisy latent at timestep $t$ , and $\epsilon_\Theta$ is the noise prediction network.

Stage II: Aligning Higher Visual Quality through Reinforcement Learning

Uses Flow-GRPO (Liu et al., 2025b) to optimize the self-recovery module with a dual-reward objective:

Pixel-Level Structural Reward — SSIM (Structural Similarity Index Measure):

R_{\text{pix}}(I_r, I_o) = \text{SSIM}(I_r, I_o) = \frac{1}{N} \sum_{i=1}^N \left[ l(p_r^i, p_o^i) \cdot c(p_r^i, p_o^i) \cdot s(p_r^i, p_o^i) \right]

where the three components for luminance, contrast, and structure are:

l(p_r, p_o) = \frac{2\mu_r \mu_o + C_1}{\mu_r^2 + \mu_o^2 + C_1}, \quad c(p_r, p_o) = \frac{2\sigma_r \sigma_o + C_2}{\sigma_r^2 + \sigma_o^2 + C_2}, \quad s(p_r, p_o) = \frac{\sigma_{ro} + C_3}{\sigma_r \sigma_o + C_3}

with $\mu_r, \mu_o$ patch means, $\sigma_r, \sigma_o$ patch standard deviations, $\sigma_{ro}$ covariance, and $C_1, C_2, C_3$ small constants for numerical stability.

Semantic Consistency Reward — uses frozen TinyCLIP to compute cosine similarity:

\text{Sim}(M_{\text{CLIP}}(I_r), M_{\text{CLIP}}(I_o)) = \frac{M_{\text{CLIP}}(I_r) \cdot M_{\text{CLIP}}(I_o)}{\|M_{\text{CLIP}}(I_r)\| \|M_{\text{CLIP}}(I_o)\|}

R_{\text{sem}}(I_r, I_o) = \exp\left(-\alpha \cdot (1 - \text{Sim}(M_{\text{CLIP}}(I_r), M_{\text{CLIP}}(I_o)))\right)

where $\alpha > 0$ controls sensitivity to semantic deviations. The reward is maximized (1) when similarity is 1, decaying exponentially as similarity decreases.

Optimization: For each $I_c$ , sample $G$ trajectories (via stochastic ODE-to-SDE conversion) to get $\{I_r^i\}_{i=1}^G$ . Advantages computed via group normalization of composite rewards; policy updated with KL divergence penalty.

Stage III: Multimodal Reasoning for Robust Understanding

Structure input as interleaved sequence of corrupted image $I_c$ and recovered image $I_r$ , followed by query $Q$ .
Train model to generate answer $A$ (with reasoning chain) conditioned on this input.
Training objective (next-token prediction):

L_{\text{MLLM}} = -\mathbb{E}_{(I_c, I_r, Q, A^*)} \sum_{t=1}^L \log P_\Theta(a_t^* | a_{<t}^*, I_c, I_r, Q)

Empirical Validation / Results

Experimental Setup

Base model: BAGEL (unified MLLM)
Training data: SFT on ImageNet-C; RL and multimodal reasoning on Robust-R1 training data
Benchmarks:
- Real-world corruption: R-Bench (MCQ, VQA, CAP tasks across low/mid/high degradation)
- Adversarial corruption: MMMB, MMStar, RealWorldQA (with synthetic degradations at 25%, 50%, 100% intensity)
Baselines: General MLLMs (Qwen2.5-VL-3B, Gemma3-4B, InternVL-4B, BAGEL) and Robust MLLMs (TeCoA, Robust CLIP, Robust LLaVA, Robust-R1)

Main Results – Real-World Corruptions

Table 1. Quantitative evaluation on R-Bench. Best/Red, second best/Blue.

Category	Method	MCQ low/mid/high	VQA low/mid/high	CAP low/mid/high	Overall
General MLLM	Qwen2.5-VL-3B	0.6411/0.6022/0.5732	0.4872/0.4854/0.4904	0.3778/0.3704/0.3330	0.4845
	Gemma3-4B	0.5823/0.5776/0.5060	0.4865/0.4630/0.4419	0.4048/0.3746/0.3480	0.4649
	InternVL-4B	0.6235/0.6024/0.5914	0.4982/0.4539/0.5108	0.3667/0.3041/0.2851	0.4706
	BAGEL	0.7176/0.6584/0.5793	0.6497/0.6127/0.6150	0.4685/0.4633/0.4288	0.5770
Robust MLLM	TeCoA	0.4647/0.4223/0.4024	0.4687/0.3994/0.4461	0.2111/0.2195/0.1937	0.3586
	Robust CLIP	0.4705/0.4658/0.4024	0.4503/0.4339/0.4743	0.2290/0.2219/0.1983	0.3718
	Robust LLaVA	0.3352/0.2608/0.3048	0.2607/0.2212/0.2443	0.0068/0.0065/0.0067	0.1830
	Robust-R1	0.6529/0.6391/0.6097	0.4914/0.4909/0.4980	0.4068/0.3781/0.3484	0.5017
Ours	Robust-U1	0.7353/0.7329/0.6768	0.7067/0.7164/0.6934	0.8272/0.8059/0.7640	0.7398

Robust-U1 outperforms all baselines across all tasks and intensities. The advantage becomes more pronounced as corruption severity increases.

Main Results – Adversarial Corruptions

Table 3. Quantitative evaluation on anti-degradation for MMMB, MMStar, RealWorldQA. Best/Red, second best/Blue.

Category	Method	MMMB clean/25%/50%/100%	MMStar clean/25%/50%/100%	RealWorldQA clean/25%/50%/100%
General MLLM	Qwen2.5-VL-3B	80.60/79.19/78.68/74.50	54.73/52.90/51.86/48.66	65.22/64.96/63.39/60.65
	Gemma3-4B	71.01/70.30/70.20/69.14	43.93/43.20/42.60/41.33	55.42/54.77/53.72/52.81
	InternVL-4B	77.97/77.47/76.66/74.59	51.53/50.26/49.60/46.93	57.38/58.16/57.64/54.90
	BAGEL	81.92/81.16/80.56/78.48	66.13/64.67/61.33/59.60	68.76/65.75/67.84/63.14
Robust MLLM	TeCoA	57.17/65.71/56.11/51.76	30.46/30.60/30.73/28.06	40.00/39.73/39.47/38.69
	Robust CLIP	58.83/58.28/57.97/53.33	33.00/32.26/31.80/29.46	43.26/42.48/42.61/41.43
	Robust-R1	81.41/79.49/79.04/75.35	56.86/54.40/53.60/49.53	67.71/66.40/67.05/63.26
Ours	Robust-U1	84.75/84.14/83.54/83.18	67.20/65.80/64.87/63.87	72.81/72.81/71.50/67.46

Robust-U1 achieves minimal performance drops: only 1.57 points on MMMB from clean to 100% corruption (vs. 3.44 for BAGEL and 6.06 for Robust-R1).

Quality of Visual Recovery

Table 5. Quantitative evaluation of visual recovery quality. Best/Red, second best/Blue.

Method	PSNR ↑	SSIM ↑	LPIPS ↓
BAGEL	14.37	0.4722	0.5092
+ SFT on ImageNet-C	20.88	0.6135	0.3444
+ RL w. $R_{\text{pix}}$	21.45	0.6311	0.3299
+ RL w. $R_{\text{sem}}$	21.33	0.6285	0.3233
Ours (Robust-U1)	21.49	0.6314	0.3223

Each stage contributes to improved recovery. The full model with both rewards achieves the best overall balance across all metrics.

Ablation Study

Table 4. Ablation on R-Bench. Best/Red, second best/Blue.

Method	MCQ low/mid/high	VQA low/mid/high	CAP low/mid/high	Overall
Baseline (BAGEL)	0.7176/0.6584/0.5793	0.6497/0.6127/0.6150	0.4685/0.4633/0.4288	0.5770
Ours (Robust-U1)	0.7353/0.7329/0.6768	0.7067/0.7164/0.6934	0.8272/0.8059/0.7640	0.7398
w/o Multimodal Reasoning	0.7294/0.6957/0.