RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

Summary (Overview)

Problem Formulation: Introduces region-specific image refinement as a new dedicated task: given an input image and a user-specified region (e.g., a scribble mask or bounding box), the goal is to restore fine-grained details (e.g., text, logos, faces) while strictly keeping all non-edited pixels unchanged.
Core Method: Proposes RefineAnything, a multimodal diffusion-based model that supports both reference-based and reference-free refinement. Its key innovation is Focus-and-Refine, a strategy that crops, resizes, and refines the target region under a fixed resolution budget before pasting it back with a blended mask, dramatically improving local detail recovery and enforcing background preservation.
Technical Contributions: Introduces a Boundary Consistency Loss to reduce seam artifacts during paste-back and constructs Refine-30K (a 30K-sample dataset) and RefineEval (an evaluation benchmark) to support training and evaluation for this new setting.
Key Results: On the RefineEval benchmark, RefineAnything significantly outperforms strong baselines (e.g., Kontext, Qwen-Edit) in both edited-region fidelity and achieves near-perfect background consistency (MSE_bg = 0.000, SSIM_bg = 0.9997).

Introduction and Theoretical Foundation

Modern image generation models, despite rapid advances, frequently suffer from local detail collapse—distortions in fine-grained elements like printed text, logos, and thin structures. This is a critical failure mode for real-world applications (e.g., e-commerce, advertising, UI design) where small details carry key information.

Existing instruction-driven editing models are ill-suited for this refinement task due to three main issues:

Weak region controllability: Difficulty in precisely specifying where to refine.
Poor micro-detail recovery: Subtle defects (e.g., broken text strokes) are often unresolved.
Background drift: Non-target regions may change unintentionally.

This paper formulates region-specific image refinement as a dedicated problem setting, requiring a tool that is simultaneously region-accurate, detail-effective, and background-preserving. The theoretical motivation stems from a counter-intuitive observation: under a fixed VAE input resolution, simply cropping a small target region and upsampling it to the full-image resolution—without adding new information—can yield substantially better VAE reconstruction quality within that region. This suggests that the bottleneck is not information availability, but how the model allocates its fixed-resolution capacity and attention.

Methodology

3.1 Architecture

RefineAnything builds upon Qwen-Image [43]. Given an input image $I$ , an optional reference image $I_{ref}$ , a user scribble mask $M$ , and a text instruction $y$ , the framework consists of:

A frozen multimodal encoder (Qwen2.5-VL [1]) that produces refinement-guiding conditioning tokens: $c = E_{\phi}\left(I, I_{ref}, M, y\right), \quad c \in \mathbb{R}^{L \times d}$
A VAE that maps images to latents for fine-grained visual context: $z_I = \text{Enc}_{\psi}(I), \quad z_{ref} = \text{Enc}_{\psi}(I_{ref}) \in \mathbb{R}^{C \times H \times W}$
A diffusion backbone (MMDiT blocks from Qwen-Image) that denoises a target latent $z_t$ conditioned on both the multimodal tokens $c$ and the VAE latent branches.

3.2 Focus-and-Refine

The core innovation, motivated by the observation in Fig. 3, involves three steps:

Region Localization and Focus Crop: Compute a tight bounding box $B = \text{BBox}(M) = (x_1, y_1, x_2, y_2)$ , expand it with a margin $m$ to get crop box $C = \text{Expand}(B, m)$ , and crop/resize:
$I_c = \text{Crop}(I, C), \quad M_c = \text{Crop}(M, C)$
Focused Generation with Spatial Conditioning: On the cropped view, use $M_c$ as the spatial cue and perform conditional generation:
$X = \left[I_c, I_{ref}, M_c\right], \quad \tilde{I}_c = G(X, y)$
where $G$ is the RefineAnything model.
Seamless Paste-Back via Blended Mask: Apply morphological dilation and Gaussian smoothing to the cropped mask to create a blended mask $\tilde{M}_c$ , then composite:
$\hat{I}_c = \tilde{M}_c \odot \tilde{I}_c + (1 - \tilde{M}_c) \odot I_c$
Finally, resize and paste $\hat{I}_c$ back to the full canvas at location $C$ .

3.3 Boundary Consistency Loss

To improve paste-back naturalness, supervision is upweighted near the edit boundary during training. Define a boundary band:

B_c = \text{Dilate}(M_c; r_{out}) - \text{Erode}(M_c; r_{in})

Following the flow-matching denoising objective, with base loss map $\ell_{base} = \| v_{\theta}(z_t, t, c, z_I, z_{ref}) - v_t \|_2^2$ , the boundary-weighted objective is:

\mathcal{L}_{\text{boundary}} = \mathbb{E}\left[ \| \ell_{base} \odot (1 + \alpha B_c) \|_1 \right]

3.4 Implementation Details

Training: Fine-tune Qwen-Image-Edit with LoRA (rank 256) on attention projections only.
Optimizer: AdamW (lr $2 \times 10^{-4}$ ), batch size 8, 20K steps.
Focus-and-Refine Parameters: Crop margin $m = 64$ ; paste-back mask uses dilation kernel $r = 7$ and Gaussian blur kernel $k = 11$ ; boundary band uses $r_{out} = r_{in} = 16$ ; boundary weighting uses $\alpha = 9$ .

Empirical Validation / Results

4 Refine-30K Dataset

A new dataset of 30K samples constructed to support training:

20K reference-based samples: Built using a pipeline of VLM grounding (Gemini3), SAM-based segmentation (SAM3), and controlled inpainting degradations (Fig. 5). Each sample provides $(I, I_{ref}, I^\star, M, y)$ .
10K reference-free samples: Built from single images using VLM for salient object localization and degradation validation to ensure meaningful refinement tasks. Each sample provides $(I, I^\star, M, y)$ .

5 Experiment

5.1 Benchmarks: RefineEval

A new benchmark with 67 manually curated cases (31 reference-based, 36 reference-free). Degraded inputs are synthesized via inpainting within annotated regions using multiple methods (Flux-fill, SDXL, Qwen-Edit), resulting in 402 test images total.

5.2 Evaluation Metrics

Reference-Based: Evaluates (i) edited-region fidelity vs. ground truth using MSE, LPIPS, SSIM, DINO, CLIP; (ii) background preservation vs. input image using MSE_bg, LPIPS_bg, SSIM_bg.
Reference-Free: Uses a VLM-based evaluator (Gemini2.5-Pro) to score the refined region on five dimensions: Visual Quality (VQ), Naturalness (Nat.), Aesthetics (Aes.), Fine-detail fidelity (Det.), and Instruction faithfulness (Faith.) on a [1,5] scale.

5.4 Quantitative Results

Table 1: Evaluation on Reference-Based Image Refinement.

Method	MSE ↓	LP ↓	VGG ↓	DINO ↑	CLIP ↑	SSIM ↑	MSE_bg ↓	LP_bg ↓	SSIM_bg ↑
Gemini2.5	0.049	0.250	0.592	0.717	0.817	0.423	0.201	0.103	0.7662
Gemini3	0.031	0.178	0.431	0.771	0.855	0.510	0.029	0.052	0.9061
GPT4o	0.083	0.370	0.918	0.620	0.801	0.302	0.815	0.309	0.6001
OmniGen2	0.155	0.602	1.691	0.384	0.717	0.219	2.094	0.624	0.4300
BAGEL	0.045	0.253	0.611	0.682	0.803	0.494	0.033	0.046	0.9360
Kontext	0.040	0.264	0.540	0.685	0.785	0.538	0.011	0.019	0.9660
Qwen-Edit	0.049	0.287	0.676	0.675	0.807	0.436	0.454	0.148	0.7530
Ours	0.020	0.155	0.401	0.793	0.885	0.591	0.000	0.000	0.9997

↓: Smaller is better, ↑: Larger is better. LP = LPIPS.

RefineAnything outperforms the strongest baseline (Kontext), reducing MSE by 50% (0.020 vs. 0.040) and LPIPS by 41% (0.155 vs. 0.264), while achieving near-perfect background consistency.

Table 2: Evaluation on the Reference-Free Image Refinement.

Method	VQ ↑	Nat. ↑	Aes. ↑	Det. ↑	Faith. ↑
OmniGen2	2.501	2.500	2.461	2.348	2.586
BAGEL	3.018	3.000	2.959	2.851	3.135
kontext	1.716	2.114	1.982	1.690	1.750
Qwen-Edit	3.081	3.110	3.105	2.975	3.214
Ours	3.806	3.868	3.876	3.720	3.644

RefineAnything ranks first on all five VLM-evaluated criteria for reference-free refinement.

5.5 Qualitative Results

Figures 6 and 7 show that RefineAnything effectively restores subtle details (text, faces, logos) while keeping the background strictly unchanged, whereas baselines often suffer from poor background preservation, weak instruction responsiveness, and limited detail recovery.

5.6 Ablation Study

Table /3: Ablation on Focus-and-Refine and Boundary Consistency Loss.

Method	MSE ↓	LP ↓	VGG ↓	DINO ↑	CLIP ↑	SSIM ↑	MSE_bg ↓	LP_bg ↓	SSIM_bg ↑
w/o focus	0.021	0.177	0.449	0.779	0.869	0.578	0.005	0.022	0.9601
w/o loss	0.023	0.191	0.482	0.736	0.858	0.563	0.000	0.000	0.9997
Ours	0.020	0.155	0.401	0.793	0.885	0.591	0.000	0.000	0.9997

Focus-and-Refine (Fig. 8): Removing it leads to weaker refinements with unresolved subtle errors.
Boundary Consistency Loss (Fig. 9): Removing it leads to visible seams and poor coherence at object boundaries.

Theoretical and Practical Implications

Theoretical: Provides a novel perspective on the limitations of fixed-resolution latent diffusion models for local tasks, demonstrating that spatial re-parameterization (crop-and-resize) can be more effective for detail recovery than processing the entire scene, even without additional information.
Practical: Delivers a practical, high-precision refinement tool for real-world image generation and editing workflows, particularly in domains where local detail accuracy and strict background preservation are critical (e.g., product imaging, graphic design, content editing).

Conclusion

RefineAnything is the first framework tailored for the region-specific image refinement task. By introducing the Focus-and-Refine strategy and a Boundary Consistency Loss, it significantly improves local detail fidelity and semantic alignment while achieving near-perfect background preservation. The release of the Refine-30K dataset and RefineEval benchmark supports future research in this practical area of high-precision image editing.