Summary of "Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?"
Summary (Overview)
- Novel self-recovery paradigm: Proposes Robust-U1, the first framework that equips Multimodal Large Language Models (MLLMs) with explicit visual self-recovery capability, directly reconstructing clean images from corrupted inputs rather than relying on implicit feature alignment or text-only reasoning.
- Three-stage training pipeline: (i) Supervised fine-tuning (SFT) on ImageNet-C for foundational reconstruction, (ii) Reinforcement learning (Flow-GRPO) with dual rewards — pixel-level SSIM and semantic-level CLIP similarity — to align structural and semantic fidelity, (iii) Multimodal reasoning that jointly considers both corrupted and recovered images for robust understanding.
- State-of-the-art performance: Achieves significant improvements on R-Bench (overall 0.7398 vs. 0.5770 for base BAGEL) and maintains minimal performance drops under adversarial corruptions on MMMB, MMStar, and RealWorldQA.
- Self-recovery enhances reasoning: Quantitative analysis confirms that high-quality visual recovery directly improves downstream reasoning performance, validating self-recovery as a critical mechanism for robust visual understanding.
- Code released: Source code available at github.com/jqtangust/Robust-U1.
Introduction and Theoretical Foundation
Background and Motivation
Multimodal Large Language Models (MLLMs) have achieved remarkable success in visual understanding through large-scale pretraining. However, their real-world deployment is critically hindered by vulnerability to visual corruptions — such as system noise, compression artifacts, and adverse weather — that severely disrupt visual features and cause dramatic performance degradation.
Limitations of Existing Approaches
Existing robustness enhancement methods fall into two categories:
-
Black-box feature alignment (e.g., TeCoA, Robust CLIP, Robust LLaVA): Align features of corrupted and clean images within the visual encoder via adversarial training. Limitation: lacks interpretability and fails to explicitly model the corruption process.
-
White-box text-based reasoning (e.g., Robust-R1): Uses explicit textual chains to describe corruption types and semantic impacts. Limitation: cannot represent pixel-level details; textual descriptions cannot restore lost visual information.
Research Question
Can MLLMs recover corrupted visual content by themselves?
The paper proposes that achieving self-recovery — where the model actively restores lost pixel-level information rather than merely offering text-based compensation — would establish a more intrinsic and complete form of robustness.
Theoretical Formulation
Consider a standard multimodal pipeline with clean image and query :
where denotes model parameters. Under real-world corruption , where is the corruption function, performance degrades significantly.
The proposed robust formulation explicitly incorporates a visual self-recovery process:
where represents the self-recovery module approximating the inverse corruption process, and the model synthesizes information from both and for robust reasoning.
Methodology
Overview of Three-Stage Framework
The framework is built on BAGEL, a pre-trained unified MLLM supporting both multimodal understanding and generation.
Stage I: Supervised Fine-Tuning for Visual Self-Recovery (SFT)
- Goal: Transform the model's general generative capability into a dedicated visual self-recovery module.
- The recovery process is conditioned on a specific recovery prompt .
- Uses rectified flow formulation (Liu et al., 2023) in latent space.
- Image is encoded into latent representation ; model denoises a noisy version of clean latent conditioned on and .
- Objective function:
where is the noisy latent at timestep , and is the noise prediction network.
Stage II: Aligning Higher Visual Quality through Reinforcement Learning
Uses Flow-GRPO (Liu et al., 2025b) to optimize the self-recovery module with a dual-reward objective:
Pixel-Level Structural Reward — SSIM (Structural Similarity Index Measure):
where the three components for luminance, contrast, and structure are:
with patch means, patch standard deviations, covariance, and small constants for numerical stability.
Semantic Consistency Reward — uses frozen TinyCLIP to compute cosine similarity:
where controls sensitivity to semantic deviations. The reward is maximized (1) when similarity is 1, decaying exponentially as similarity decreases.
Optimization: For each , sample trajectories (via stochastic ODE-to-SDE conversion) to get . Advantages computed via group normalization of composite rewards; policy updated with KL divergence penalty.
Stage III: Multimodal Reasoning for Robust Understanding
- Structure input as interleaved sequence of corrupted image and recovered image , followed by query .
- Train model to generate answer (with reasoning chain) conditioned on this input.
- Training objective (next-token prediction):
Empirical Validation / Results
Experimental Setup
- Base model: BAGEL (unified MLLM)
- Training data: SFT on ImageNet-C; RL and multimodal reasoning on Robust-R1 training data
- Benchmarks:
- Real-world corruption: R-Bench (MCQ, VQA, CAP tasks across low/mid/high degradation)
- Adversarial corruption: MMMB, MMStar, RealWorldQA (with synthetic degradations at 25%, 50%, 100% intensity)
- Baselines: General MLLMs (Qwen2.5-VL-3B, Gemma3-4B, InternVL-4B, BAGEL) and Robust MLLMs (TeCoA, Robust CLIP, Robust LLaVA, Robust-R1)
Main Results – Real-World Corruptions
Table 1. Quantitative evaluation on R-Bench. Best/Red, second best/Blue.
| Category | Method | MCQ low/mid/high | VQA low/mid/high | CAP low/mid/high | Overall |
|---|---|---|---|---|---|
| General MLLM | Qwen2.5-VL-3B | 0.6411/0.6022/0.5732 | 0.4872/0.4854/0.4904 | 0.3778/0.3704/0.3330 | 0.4845 |
| Gemma3-4B | 0.5823/0.5776/0.5060 | 0.4865/0.4630/0.4419 | 0.4048/0.3746/0.3480 | 0.4649 | |
| InternVL-4B | 0.6235/0.6024/0.5914 | 0.4982/0.4539/0.5108 | 0.3667/0.3041/0.2851 | 0.4706 | |
| BAGEL | 0.7176/0.6584/0.5793 | 0.6497/0.6127/0.6150 | 0.4685/0.4633/0.4288 | 0.5770 | |
| Robust MLLM | TeCoA | 0.4647/0.4223/0.4024 | 0.4687/0.3994/0.4461 | 0.2111/0.2195/0.1937 | 0.3586 |
| Robust CLIP | 0.4705/0.4658/0.4024 | 0.4503/0.4339/0.4743 | 0.2290/0.2219/0.1983 | 0.3718 | |
| Robust LLaVA | 0.3352/0.2608/0.3048 | 0.2607/0.2212/0.2443 | 0.0068/0.0065/0.0067 | 0.1830 | |
| Robust-R1 | 0.6529/0.6391/0.6097 | 0.4914/0.4909/0.4980 | 0.4068/0.3781/0.3484 | 0.5017 | |
| Ours | Robust-U1 | 0.7353/0.7329/0.6768 | 0.7067/0.7164/0.6934 | 0.8272/0.8059/0.7640 | 0.7398 |
Robust-U1 outperforms all baselines across all tasks and intensities. The advantage becomes more pronounced as corruption severity increases.
Main Results – Adversarial Corruptions
Table 3. Quantitative evaluation on anti-degradation for MMMB, MMStar, RealWorldQA. Best/Red, second best/Blue.
| Category | Method | MMMB clean/25%/50%/100% | MMStar clean/25%/50%/100% | RealWorldQA clean/25%/50%/100% |
|---|---|---|---|---|
| General MLLM | Qwen2.5-VL-3B | 80.60/79.19/78.68/74.50 | 54.73/52.90/51.86/48.66 | 65.22/64.96/63.39/60.65 |
| Gemma3-4B | 71.01/70.30/70.20/69.14 | 43.93/43.20/42.60/41.33 | 55.42/54.77/53.72/52.81 | |
| InternVL-4B | 77.97/77.47/76.66/74.59 | 51.53/50.26/49.60/46.93 | 57.38/58.16/57.64/54.90 | |
| BAGEL | 81.92/81.16/80.56/78.48 | 66.13/64.67/61.33/59.60 | 68.76/65.75/67.84/63.14 | |
| Robust MLLM | TeCoA | 57.17/65.71/56.11/51.76 | 30.46/30.60/30.73/28.06 | 40.00/39.73/39.47/38.69 |
| Robust CLIP | 58.83/58.28/57.97/53.33 | 33.00/32.26/31.80/29.46 | 43.26/42.48/42.61/41.43 | |
| Robust-R1 | 81.41/79.49/79.04/75.35 | 56.86/54.40/53.60/49.53 | 67.71/66.40/67.05/63.26 | |
| Ours | Robust-U1 | 84.75/84.14/83.54/83.18 | 67.20/65.80/64.87/63.87 | 72.81/72.81/71.50/67.46 |
Robust-U1 achieves minimal performance drops: only 1.57 points on MMMB from clean to 100% corruption (vs. 3.44 for BAGEL and 6.06 for Robust-R1).
Quality of Visual Recovery
Table 5. Quantitative evaluation of visual recovery quality. Best/Red, second best/Blue.
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| BAGEL | 14.37 | 0.4722 | 0.5092 |
| + SFT on ImageNet-C | 20.88 | 0.6135 | 0.3444 |
| + RL w. | 21.45 | 0.6311 | 0.3299 |
| + RL w. | 21.33 | 0.6285 | 0.3233 |
| Ours (Robust-U1) | 21.49 | 0.6314 | 0.3223 |
Each stage contributes to improved recovery. The full model with both rewards achieves the best overall balance across all metrics.
Ablation Study
Table 4. Ablation on R-Bench. Best/Red, second best/Blue.
| Method | MCQ low/mid/high | VQA low/mid/high | CAP low/mid/high | Overall |
|---|---|---|---|---|
| Baseline (BAGEL) | 0.7176/0.6584/0.5793 | 0.6497/0.6127/0.6150 | 0.4685/0.4633/0.4288 | 0.5770 |
| Ours (Robust-U1) | 0.7353/0.7329/0.6768 | 0.7067/0.7164/0.6934 | 0.8272/0.8059/0.7640 | 0.7398 |
| w/o Multimodal Reasoning | 0.7294/0.6957/0. |
Related papers
- SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research
SciAtlas introduces a large-scale, multi-disciplinary knowledge graph and a neuro-symbolic retrieval algorithm that enables deep topological reasoning for automated scientific research, reducing computational cost and hallucination risks.
- DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
DV-World introduces a comprehensive real-world benchmark showing that state-of-the-art AI agents perform below 50% on tasks requiring native spreadsheet manipulation, cross-framework adaptation, and proactive user interaction.
- Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification
Vision2Web introduces a hierarchical benchmark and workflow-based verification that reveals substantial performance gaps in state-of-the-art models as task complexity increases from static UI to full-stack development.