Summary of "Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?"

Summary (Overview)

  • Novel self-recovery paradigm: Proposes Robust-U1, the first framework that equips Multimodal Large Language Models (MLLMs) with explicit visual self-recovery capability, directly reconstructing clean images from corrupted inputs rather than relying on implicit feature alignment or text-only reasoning.
  • Three-stage training pipeline: (i) Supervised fine-tuning (SFT) on ImageNet-C for foundational reconstruction, (ii) Reinforcement learning (Flow-GRPO) with dual rewards — pixel-level SSIM and semantic-level CLIP similarity — to align structural and semantic fidelity, (iii) Multimodal reasoning that jointly considers both corrupted and recovered images for robust understanding.
  • State-of-the-art performance: Achieves significant improvements on R-Bench (overall 0.7398 vs. 0.5770 for base BAGEL) and maintains minimal performance drops under adversarial corruptions on MMMB, MMStar, and RealWorldQA.
  • Self-recovery enhances reasoning: Quantitative analysis confirms that high-quality visual recovery directly improves downstream reasoning performance, validating self-recovery as a critical mechanism for robust visual understanding.
  • Code released: Source code available at github.com/jqtangust/Robust-U1.

Introduction and Theoretical Foundation

Background and Motivation

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding through large-scale pretraining. However, their real-world deployment is critically hindered by vulnerability to visual corruptions — such as system noise, compression artifacts, and adverse weather — that severely disrupt visual features and cause dramatic performance degradation.

Limitations of Existing Approaches

Existing robustness enhancement methods fall into two categories:

  1. Black-box feature alignment (e.g., TeCoA, Robust CLIP, Robust LLaVA): Align features of corrupted and clean images within the visual encoder via adversarial training. Limitation: lacks interpretability and fails to explicitly model the corruption process.

  2. White-box text-based reasoning (e.g., Robust-R1): Uses explicit textual chains to describe corruption types and semantic impacts. Limitation: cannot represent pixel-level details; textual descriptions cannot restore lost visual information.

Research Question

Can MLLMs recover corrupted visual content by themselves?

The paper proposes that achieving self-recovery — where the model actively restores lost pixel-level information rather than merely offering text-based compensation — would establish a more intrinsic and complete form of robustness.

Theoretical Formulation

Consider a standard multimodal pipeline with clean image IoRH×W×3I_o \in \mathbb{R}^{H \times W \times 3} and query QQ:

Ao=FMLLM(Io,Q;Θ)A_o = \mathcal{F}_{\text{MLLM}}(I_o, Q; \Theta)

where Θ\Theta denotes model parameters. Under real-world corruption Ic=D(Io)I_c = \mathcal{D}(I_o), where D\mathcal{D} is the corruption function, performance degrades significantly.

The proposed robust formulation explicitly incorporates a visual self-recovery process:

A=FMLLM(Robust)(Ic,Q;Θ)=FMLLM(D1(Ic)Ir,Ic,Q;Θ)A = \mathcal{F}_{\text{MLLM}}^{\text{(Robust)}}(I_c, Q; \Theta) = \mathcal{F}_{\text{MLLM}}\left(\underbrace{\mathcal{D}^{-1}(I_c)}_{I_r}, I_c, Q; \Theta\right)

where D1:IcIr\mathcal{D}^{-1}: I_c \mapsto I_r represents the self-recovery module approximating the inverse corruption process, and the model synthesizes information from both IcI_c and IrI_r for robust reasoning.

Methodology

Overview of Three-Stage Framework

The framework is built on BAGEL, a pre-trained unified MLLM supporting both multimodal understanding and generation.

Stage I: Supervised Fine-Tuning for Visual Self-Recovery (SFT)

  • Goal: Transform the model's general generative capability into a dedicated visual self-recovery module.
  • The recovery process is conditioned on a specific recovery prompt PrecP_{\text{rec}}.
  • Uses rectified flow formulation (Liu et al., 2023) in latent space.
  • Image is encoded into latent representation ZcZ_c; model denoises a noisy version of clean latent ZoZ_o conditioned on ZcZ_c and PrecP_{\text{rec}}.
  • Objective function:
LSFT=EtU(0,1),ϵN(0,I)[ϵϵΘ(Zc,Zo(t),t,Prec)2]L_{\text{SFT}} = \mathbb{E}_{t \sim \mathcal{U}(0,1), \epsilon \sim \mathcal{N}(0, I)} \left[ \| \epsilon - \epsilon_\Theta(Z_c, Z_o(t), t, P_{\text{rec}}) \|^2 \right]

where Zo(t)=(1t)Zo+tϵZ_o(t) = (1-t)Z_o + t\epsilon is the noisy latent at timestep tt, and ϵΘ\epsilon_\Theta is the noise prediction network.

Stage II: Aligning Higher Visual Quality through Reinforcement Learning

Uses Flow-GRPO (Liu et al., 2025b) to optimize the self-recovery module with a dual-reward objective:

Pixel-Level Structural Reward — SSIM (Structural Similarity Index Measure):

Rpix(Ir,Io)=SSIM(Ir,Io)=1Ni=1N[l(pri,poi)c(pri,poi)s(pri,poi)]R_{\text{pix}}(I_r, I_o) = \text{SSIM}(I_r, I_o) = \frac{1}{N} \sum_{i=1}^N \left[ l(p_r^i, p_o^i) \cdot c(p_r^i, p_o^i) \cdot s(p_r^i, p_o^i) \right]

where the three components for luminance, contrast, and structure are:

l(pr,po)=2μrμo+C1μr2+μo2+C1,c(pr,po)=2σrσo+C2σr2+σo2+C2,s(pr,po)=σro+C3σrσo+C3l(p_r, p_o) = \frac{2\mu_r \mu_o + C_1}{\mu_r^2 + \mu_o^2 + C_1}, \quad c(p_r, p_o) = \frac{2\sigma_r \sigma_o + C_2}{\sigma_r^2 + \sigma_o^2 + C_2}, \quad s(p_r, p_o) = \frac{\sigma_{ro} + C_3}{\sigma_r \sigma_o + C_3}

with μr,μo\mu_r, \mu_o patch means, σr,σo\sigma_r, \sigma_o patch standard deviations, σro\sigma_{ro} covariance, and C1,C2,C3C_1, C_2, C_3 small constants for numerical stability.

Semantic Consistency Reward — uses frozen TinyCLIP to compute cosine similarity:

Sim(MCLIP(Ir),MCLIP(Io))=MCLIP(Ir)MCLIP(Io)MCLIP(Ir)MCLIP(Io)\text{Sim}(M_{\text{CLIP}}(I_r), M_{\text{CLIP}}(I_o)) = \frac{M_{\text{CLIP}}(I_r) \cdot M_{\text{CLIP}}(I_o)}{\|M_{\text{CLIP}}(I_r)\| \|M_{\text{CLIP}}(I_o)\|} Rsem(Ir,Io)=exp(α(1Sim(MCLIP(Ir),MCLIP(Io))))R_{\text{sem}}(I_r, I_o) = \exp\left(-\alpha \cdot (1 - \text{Sim}(M_{\text{CLIP}}(I_r), M_{\text{CLIP}}(I_o)))\right)

where α>0\alpha > 0 controls sensitivity to semantic deviations. The reward is maximized (1) when similarity is 1, decaying exponentially as similarity decreases.

Optimization: For each IcI_c, sample GG trajectories (via stochastic ODE-to-SDE conversion) to get {Iri}i=1G\{I_r^i\}_{i=1}^G. Advantages computed via group normalization of composite rewards; policy updated with KL divergence penalty.

Stage III: Multimodal Reasoning for Robust Understanding

  • Structure input as interleaved sequence of corrupted image IcI_c and recovered image IrI_r, followed by query QQ.
  • Train model to generate answer AA (with reasoning chain) conditioned on this input.
  • Training objective (next-token prediction):
LMLLM=E(Ic,Ir,Q,A)t=1LlogPΘ(ata<t,Ic,Ir,Q)L_{\text{MLLM}} = -\mathbb{E}_{(I_c, I_r, Q, A^*)} \sum_{t=1}^L \log P_\Theta(a_t^* | a_{<t}^*, I_c, I_r, Q)

Empirical Validation / Results

Experimental Setup

  • Base model: BAGEL (unified MLLM)
  • Training data: SFT on ImageNet-C; RL and multimodal reasoning on Robust-R1 training data
  • Benchmarks:
    • Real-world corruption: R-Bench (MCQ, VQA, CAP tasks across low/mid/high degradation)
    • Adversarial corruption: MMMB, MMStar, RealWorldQA (with synthetic degradations at 25%, 50%, 100% intensity)
  • Baselines: General MLLMs (Qwen2.5-VL-3B, Gemma3-4B, InternVL-4B, BAGEL) and Robust MLLMs (TeCoA, Robust CLIP, Robust LLaVA, Robust-R1)

Main Results – Real-World Corruptions

Table 1. Quantitative evaluation on R-Bench. Best/Red, second best/Blue.

CategoryMethodMCQ low/mid/highVQA low/mid/highCAP low/mid/highOverall
General MLLMQwen2.5-VL-3B0.6411/0.6022/0.57320.4872/0.4854/0.49040.3778/0.3704/0.33300.4845
Gemma3-4B0.5823/0.5776/0.50600.4865/0.4630/0.44190.4048/0.3746/0.34800.4649
InternVL-4B0.6235/0.6024/0.59140.4982/0.4539/0.51080.3667/0.3041/0.28510.4706
BAGEL0.7176/0.6584/0.57930.6497/0.6127/0.61500.4685/0.4633/0.42880.5770
Robust MLLMTeCoA0.4647/0.4223/0.40240.4687/0.3994/0.44610.2111/0.2195/0.19370.3586
Robust CLIP0.4705/0.4658/0.40240.4503/0.4339/0.47430.2290/0.2219/0.19830.3718
Robust LLaVA0.3352/0.2608/0.30480.2607/0.2212/0.24430.0068/0.0065/0.00670.1830
Robust-R10.6529/0.6391/0.60970.4914/0.4909/0.49800.4068/0.3781/0.34840.5017
OursRobust-U10.7353/0.7329/0.67680.7067/0.7164/0.69340.8272/0.8059/0.76400.7398

Robust-U1 outperforms all baselines across all tasks and intensities. The advantage becomes more pronounced as corruption severity increases.

Main Results – Adversarial Corruptions

Table 3. Quantitative evaluation on anti-degradation for MMMB, MMStar, RealWorldQA. Best/Red, second best/Blue.

CategoryMethodMMMB clean/25%/50%/100%MMStar clean/25%/50%/100%RealWorldQA clean/25%/50%/100%
General MLLMQwen2.5-VL-3B80.60/79.19/78.68/74.5054.73/52.90/51.86/48.6665.22/64.96/63.39/60.65
Gemma3-4B71.01/70.30/70.20/69.1443.93/43.20/42.60/41.3355.42/54.77/53.72/52.81
InternVL-4B77.97/77.47/76.66/74.5951.53/50.26/49.60/46.9357.38/58.16/57.64/54.90
BAGEL81.92/81.16/80.56/78.4866.13/64.67/61.33/59.6068.76/65.75/67.84/63.14
Robust MLLMTeCoA57.17/65.71/56.11/51.7630.46/30.60/30.73/28.0640.00/39.73/39.47/38.69
Robust CLIP58.83/58.28/57.97/53.3333.00/32.26/31.80/29.4643.26/42.48/42.61/41.43
Robust-R181.41/79.49/79.04/75.3556.86/54.40/53.60/49.5367.71/66.40/67.05/63.26
OursRobust-U184.75/84.14/83.54/83.1867.20/65.80/64.87/63.8772.81/72.81/71.50/67.46

Robust-U1 achieves minimal performance drops: only 1.57 points on MMMB from clean to 100% corruption (vs. 3.44 for BAGEL and 6.06 for Robust-R1).

Quality of Visual Recovery

Table 5. Quantitative evaluation of visual recovery quality. Best/Red, second best/Blue.

MethodPSNR ↑SSIM ↑LPIPS ↓
BAGEL14.370.47220.5092
+ SFT on ImageNet-C20.880.61350.3444
+ RL w. RpixR_{\text{pix}}21.450.63110.3299
+ RL w. RsemR_{\text{sem}}21.330.62850.3233
Ours (Robust-U1)21.490.63140.3223

Each stage contributes to improved recovery. The full model with both rewards achieves the best overall balance across all metrics.

Ablation Study

Table 4. Ablation on R-Bench. Best/Red, second best/Blue.

MethodMCQ low/mid/highVQA low/mid/highCAP low/mid/highOverall
Baseline (BAGEL)0.7176/0.6584/0.57930.6497/0.6127/0.61500.4685/0.4633/0.42880.5770
Ours (Robust-U1)0.7353/0.7329/0.67680.7067/0.7164/0.69340.8272/0.8059/0.76400.7398
w/o Multimodal Reasoning0.7294/0.6957/0.

Related papers