# Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

> Robust-U1 enables MLLMs to explicitly self-recover corrupted images, achieving state-of-the-art robust understanding across real-world and adversarial corruptions.

- **Source:** [arXiv](https://arxiv.org/abs/2606.08063)
- **Published:** 2026-06-13
- **Permalink:** https://picx.dev/p/H41geU
- **Whiteboard:** https://picx.dev/p/H41geU/image

## Summary

# Summary of "Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?"

## Summary (Overview)

- **Novel self-recovery paradigm**: Proposes **Robust-U1**, the first framework that equips Multimodal Large Language Models (MLLMs) with explicit visual self-recovery capability, directly reconstructing clean images from corrupted inputs rather than relying on implicit feature alignment or text-only reasoning.
- **Three-stage training pipeline**: (i) Supervised fine-tuning (SFT) on ImageNet-C for foundational reconstruction, (ii) Reinforcement learning (Flow-GRPO) with dual rewards — pixel-level SSIM and semantic-level CLIP similarity — to align structural and semantic fidelity, (iii) Multimodal reasoning that jointly considers both corrupted and recovered images for robust understanding.
- **State-of-the-art performance**: Achieves significant improvements on R-Bench (overall 0.7398 vs. 0.5770 for base BAGEL) and maintains minimal performance drops under adversarial corruptions on MMMB, MMStar, and RealWorldQA.
- **Self-recovery enhances reasoning**: Quantitative analysis confirms that high-quality visual recovery directly improves downstream reasoning performance, validating self-recovery as a critical mechanism for robust visual understanding.
- **Code released**: Source code available at [github.com/jqtangust/Robust-U1](https://github.com/jqtangust/Robust-U1).

## Introduction and Theoretical Foundation

### Background and Motivation

Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding through large-scale pretraining. However, their real-world deployment is critically hindered by vulnerability to visual corruptions — such as system noise, compression artifacts, and adverse weather — that severely disrupt visual features and cause dramatic performance degradation.

### Limitations of Existing Approaches

Existing robustness enhancement methods fall into two categories:

1. **Black-box feature alignment** (e.g., TeCoA, Robust CLIP, Robust LLaVA): Align features of corrupted and clean images within the visual encoder via adversarial training. *Limitation*: lacks interpretability and fails to explicitly model the corruption process.

2. **White-box text-based reasoning** (e.g., Robust-R1): Uses explicit textual chains to describe corruption types and semantic impacts. *Limitation*: cannot represent pixel-level details; textual descriptions cannot restore lost visual information.

### Research Question

> Can MLLMs recover corrupted visual content by themselves?

The paper proposes that achieving **self-recovery** — where the model actively restores lost pixel-level information rather than merely offering text-based compensation — would establish a more intrinsic and complete form of robustness.

### Theoretical Formulation

Consider a standard multimodal pipeline with clean image $I_o \in \mathbb{R}^{H \times W \times 3}$ and query $Q$:

$$A_o = \mathcal{F}_{\text{MLLM}}(I_o, Q; \Theta)$$

where $\Theta$ denotes model parameters. Under real-world corruption $I_c = \mathcal{D}(I_o)$, where $\mathcal{D}$ is the corruption function, performance degrades significantly.

The proposed robust formulation explicitly incorporates a visual self-recovery process:

$$A = \mathcal{F}_{\text{MLLM}}^{\text{(Robust)}}(I_c, Q; \Theta) = \mathcal{F}_{\text{MLLM}}\left(\underbrace{\mathcal{D}^{-1}(I_c)}_{I_r}, I_c, Q; \Theta\right)$$

where $\mathcal{D}^{-1}: I_c \mapsto I_r$ represents the self-recovery module approximating the inverse corruption process, and the model synthesizes information from both $I_c$ and $I_r$ for robust reasoning.

## Methodology

### Overview of Three-Stage Framework

The framework is built on **BAGEL**, a pre-trained unified MLLM supporting both multimodal understanding and generation.

#### Stage I: Supervised Fine-Tuning for Visual Self-Recovery (SFT)

- Goal: Transform the model's general generative capability into a dedicated visual self-recovery module.
- The recovery process is conditioned on a specific recovery prompt $P_{\text{rec}}$.
- Uses **rectified flow formulation** (Liu et al., 2023) in latent space.
- Image is encoded into latent representation $Z_c$; model denoises a noisy version of clean latent $Z_o$ conditioned on $Z_c$ and $P_{\text{rec}}$.
- Objective function:

$$L_{\text{SFT}} = \mathbb{E}_{t \sim \mathcal{U}(0,1), \epsilon \sim \mathcal{N}(0, I)} \left[ \| \epsilon - \epsilon_\Theta(Z_c, Z_o(t), t, P_{\text{rec}}) \|^2 \right]$$

where $Z_o(t) = (1-t)Z_o + t\epsilon$ is the noisy latent at timestep $t$, and $\epsilon_\Theta$ is the noise prediction network.

#### Stage II: Aligning Higher Visual Quality through Reinforcement Learning

Uses **Flow-GRPO** (Liu et al., 2025b) to optimize the self-recovery module with a dual-reward objective:

**Pixel-Level Structural Reward** — SSIM (Structural Similarity Index Measure):

$$R_{\text{pix}}(I_r, I_o) = \text{SSIM}(I_r, I_o) = \frac{1}{N} \sum_{i=1}^N \left[ l(p_r^i, p_o^i) \cdot c(p_r^i, p_o^i) \cdot s(p_r^i, p_o^i) \right]$$

where the three components for luminance, contrast, and structure are:

$$l(p_r, p_o) = \frac{2\mu_r \mu_o + C_1}{\mu_r^2 + \mu_o^2 + C_1}, \quad c(p_r, p_o) = \frac{2\sigma_r \sigma_o + C_2}{\sigma_r^2 + \sigma_o^2 + C_2}, \quad s(p_r, p_o) = \frac{\sigma_{ro} + C_3}{\sigma_r \sigma_o + C_3}$$

with $\mu_r, \mu_o$ patch means, $\sigma_r, \sigma_o$ patch standard deviations, $\sigma_{ro}$ covariance, and $C_1, C_2, C_3$ small constants for numerical stability.

**Semantic Consistency Reward** — uses frozen TinyCLIP to compute cosine similarity:

$$\text{Sim}(M_{\text{CLIP}}(I_r), M_{\text{CLIP}}(I_o)) = \frac{M_{\text{CLIP}}(I_r) \cdot M_{\text{CLIP}}(I_o)}{\|M_{\text{CLIP}}(I_r)\| \|M_{\text{CLIP}}(I_o)\|}$$

$$R_{\text{sem}}(I_r, I_o) = \exp\left(-\alpha \cdot (1 - \text{Sim}(M_{\text{CLIP}}(I_r), M_{\text{CLIP}}(I_o)))\right)$$

where $\alpha > 0$ controls sensitivity to semantic deviations. The reward is maximized (1) when similarity is 1, decaying exponentially as similarity decreases.

**Optimization**: For each $I_c$, sample $G$ trajectories (via stochastic ODE-to-SDE conversion) to get $\{I_r^i\}_{i=1}^G$. Advantages computed via group normalization of composite rewards; policy updated with KL divergence penalty.

#### Stage III: Multimodal Reasoning for Robust Understanding

- Structure input as interleaved sequence of corrupted image $I_c$ and recovered image $I_r$, followed by query $Q$.
- Train model to generate answer $A$ (with reasoning chain) conditioned on this input.
- Training objective (next-token prediction):

$$L_{\text{MLLM}} = -\mathbb{E}_{(I_c, I_r, Q, A^*)} \sum_{t=1}^L \log P_\Theta(a_t^* | a_{<t}^*, I_c, I_r, Q)$$

## Empirical Validation / Results

### Experimental Setup

- **Base model**: BAGEL (unified MLLM)
- **Training data**: SFT on ImageNet-C; RL and multimodal reasoning on Robust-R1 training data
- **Benchmarks**:
  - Real-world corruption: **R-Bench** (MCQ, VQA, CAP tasks across low/mid/high degradation)
  - Adversarial corruption: **MMMB**, **MMStar**, **RealWorldQA** (with synthetic degradations at 25%, 50%, 100% intensity)
- **Baselines**: General MLLMs (Qwen2.5-VL-3B, Gemma3-4B, InternVL-4B, BAGEL) and Robust MLLMs (TeCoA, Robust CLIP, Robust LLaVA, Robust-R1)

### Main Results – Real-World Corruptions

**Table 1. Quantitative evaluation on R-Bench. Best/Red, second best/Blue.**

| Category | Method | MCQ low/mid/high | VQA low/mid/high | CAP low/mid/high | Overall |
|----------|--------|------------------|------------------|------------------|---------|
| General MLLM | Qwen2.5-VL-3B | 0.6411/0.6022/0.5732 | 0.4872/0.4854/0.4904 | 0.3778/0.3704/0.3330 | 0.4845 |
| | Gemma3-4B | 0.5823/0.5776/0.5060 | 0.4865/0.4630/0.4419 | 0.4048/0.3746/0.3480 | 0.4649 |
| | InternVL-4B | 0.6235/0.6024/0.5914 | 0.4982/0.4539/0.5108 | 0.3667/0.3041/0.2851 | 0.4706 |
| | BAGEL | 0.7176/0.6584/0.5793 | 0.6497/0.6127/0.6150 | 0.4685/0.4633/0.4288 | 0.5770 |
| Robust MLLM | TeCoA | 0.4647/0.4223/0.4024 | 0.4687/0.3994/0.4461 | 0.2111/0.2195/0.1937 | 0.3586 |
| | Robust CLIP | 0.4705/0.4658/0.4024 | 0.4503/0.4339/0.4743 | 0.2290/0.2219/0.1983 | 0.3718 |
| | Robust LLaVA | 0.3352/0.2608/0.3048 | 0.2607/0.2212/0.2443 | 0.0068/0.0065/0.0067 | 0.1830 |
| | Robust-R1 | 0.6529/0.6391/0.6097 | 0.4914/0.4909/0.4980 | 0.4068/0.3781/0.3484 | 0.5017 |
| **Ours** | **Robust-U1** | **0.7353/0.7329/0.6768** | **0.7067/0.7164/0.6934** | **0.8272/0.8059/0.7640** | **0.7398** |

Robust-U1 outperforms all baselines across all tasks and intensities. The advantage becomes more pronounced as corruption severity increases.

### Main Results – Adversarial Corruptions

**Table 3. Quantitative evaluation on anti-degradation for MMMB, MMStar, RealWorldQA. Best/Red, second best/Blue.**

| Category | Method | MMMB clean/25%/50%/100% | MMStar clean/25%/50%/100% | RealWorldQA clean/25%/50%/100% |
|----------|--------|--------------------------|----------------------------|--------------------------------|
| General MLLM | Qwen2.5-VL-3B | 80.60/79.19/78.68/74.50 | 54.73/52.90/51.86/48.66 | 65.22/64.96/63.39/60.65 |
| | Gemma3-4B | 71.01/70.30/70.20/69.14 | 43.93/43.20/42.60/41.33 | 55.42/54.77/53.72/52.81 |
| | InternVL-4B | 77.97/77.47/76.66/74.59 | 51.53/50.26/49.60/46.93 | 57.38/58.16/57.64/54.90 |
| | BAGEL | 81.92/81.16/80.56/78.48 | 66.13/64.67/61.33/59.60 | 68.76/65.75/67.84/63.14 |
| Robust MLLM | TeCoA | 57.17/65.71/56.11/51.76 | 30.46/30.60/30.73/28.06 | 40.00/39.73/39.47/38.69 |
| | Robust CLIP | 58.83/58.28/57.97/53.33 | 33.00/32.26/31.80/29.46 | 43.26/42.48/42.61/41.43 |
| | Robust-R1 | 81.41/79.49/79.04/75.35 | 56.86/54.40/53.60/49.53 | 67.71/66.40/67.05/63.26 |
| **Ours** | **Robust-U1** | **84.75/84.14/83.54/83.18** | **67.20/65.80/64.87/63.87** | **72.81/72.81/71.50/67.46** |

Robust-U1 achieves minimal performance drops: only 1.57 points on MMMB from clean to 100% corruption (vs. 3.44 for BAGEL and 6.06 for Robust-R1).

### Quality of Visual Recovery

**Table 5. Quantitative evaluation of visual recovery quality. Best/Red, second best/Blue.**

| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|--------|--------|--------|---------|
| BAGEL | 14.37 | 0.4722 | 0.5092 |
| + SFT on ImageNet-C | 20.88 | 0.6135 | 0.3444 |
| + RL w. $R_{\text{pix}}$ | 21.45 | 0.6311 | 0.3299 |
| + RL w. $R_{\text{sem}}$ | 21.33 | 0.6285 | 0.3233 |
| **Ours (Robust-U1)** | **21.49** | **0.6314** | **0.3223** |

Each stage contributes to improved recovery. The full model with both rewards achieves the best overall balance across all metrics.

### Ablation Study

**Table 4. Ablation on R-Bench. Best/Red, second best/Blue.**

| Method | MCQ low/mid/high | VQA low/mid/high | CAP low/mid/high | Overall |
|--------|------------------|------------------|------------------|---------|
| Baseline (BAGEL) | 0.7176/0.6584/0.5793 | 0.6497/0.6127/0.6150 | 0.4685/0.4633/0.4288 | 0.5770 |
| **Ours (Robust-U1)** | **0.7353/0.7329/0.6768** | **0.7067/0.7164/0.6934** | **0.8272/0.8059/0.7640** | **0.7398** |
| w/o Multimodal Reasoning | 0.7294/0.6957/0.

---

_Markdown view of https://picx.dev/p/H41geU, served by PicX — AI-generated visual whiteboard summaries of research papers._