RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

Summary (Overview)

  • Problem Formulation: Introduces region-specific image refinement as a new dedicated task: given an input image and a user-specified region (e.g., a scribble mask or bounding box), the goal is to restore fine-grained details (e.g., text, logos, faces) while strictly keeping all non-edited pixels unchanged.
  • Core Method: Proposes RefineAnything, a multimodal diffusion-based model that supports both reference-based and reference-free refinement. Its key innovation is Focus-and-Refine, a strategy that crops, resizes, and refines the target region under a fixed resolution budget before pasting it back with a blended mask, dramatically improving local detail recovery and enforcing background preservation.
  • Technical Contributions: Introduces a Boundary Consistency Loss to reduce seam artifacts during paste-back and constructs Refine-30K (a 30K-sample dataset) and RefineEval (an evaluation benchmark) to support training and evaluation for this new setting.
  • Key Results: On the RefineEval benchmark, RefineAnything significantly outperforms strong baselines (e.g., Kontext, Qwen-Edit) in both edited-region fidelity and achieves near-perfect background consistency (MSE_bg = 0.000, SSIM_bg = 0.9997).

Introduction and Theoretical Foundation

Modern image generation models, despite rapid advances, frequently suffer from local detail collapse—distortions in fine-grained elements like printed text, logos, and thin structures. This is a critical failure mode for real-world applications (e.g., e-commerce, advertising, UI design) where small details carry key information.

Existing instruction-driven editing models are ill-suited for this refinement task due to three main issues:

  1. Weak region controllability: Difficulty in precisely specifying where to refine.
  2. Poor micro-detail recovery: Subtle defects (e.g., broken text strokes) are often unresolved.
  3. Background drift: Non-target regions may change unintentionally.

This paper formulates region-specific image refinement as a dedicated problem setting, requiring a tool that is simultaneously region-accurate, detail-effective, and background-preserving. The theoretical motivation stems from a counter-intuitive observation: under a fixed VAE input resolution, simply cropping a small target region and upsampling it to the full-image resolution—without adding new information—can yield substantially better VAE reconstruction quality within that region. This suggests that the bottleneck is not information availability, but how the model allocates its fixed-resolution capacity and attention.

Methodology

3.1 Architecture

RefineAnything builds upon Qwen-Image [43]. Given an input image II, an optional reference image IrefI_{ref}, a user scribble mask MM, and a text instruction yy, the framework consists of:

  1. A frozen multimodal encoder (Qwen2.5-VL [1]) that produces refinement-guiding conditioning tokens: c=Eϕ(I,Iref,M,y),cRL×dc = E_{\phi}\left(I, I_{ref}, M, y\right), \quad c \in \mathbb{R}^{L \times d}
  2. A VAE that maps images to latents for fine-grained visual context: zI=Encψ(I),zref=Encψ(Iref)RC×H×Wz_I = \text{Enc}_{\psi}(I), \quad z_{ref} = \text{Enc}_{\psi}(I_{ref}) \in \mathbb{R}^{C \times H \times W}
  3. A diffusion backbone (MMDiT blocks from Qwen-Image) that denoises a target latent ztz_t conditioned on both the multimodal tokens cc and the VAE latent branches.

3.2 Focus-and-Refine

The core innovation, motivated by the observation in Fig. 3, involves three steps:

  1. Region Localization and Focus Crop: Compute a tight bounding box B=BBox(M)=(x1,y1,x2,y2)B = \text{BBox}(M) = (x_1, y_1, x_2, y_2), expand it with a margin mm to get crop box C=Expand(B,m)C = \text{Expand}(B, m), and crop/resize:

    Ic=Crop(I,C),Mc=Crop(M,C)I_c = \text{Crop}(I, C), \quad M_c = \text{Crop}(M, C)
  2. Focused Generation with Spatial Conditioning: On the cropped view, use McM_c as the spatial cue and perform conditional generation:

    X=[Ic,Iref,Mc],I~c=G(X,y)X = \left[I_c, I_{ref}, M_c\right], \quad \tilde{I}_c = G(X, y)

    where GG is the RefineAnything model.

  3. Seamless Paste-Back via Blended Mask: Apply morphological dilation and Gaussian smoothing to the cropped mask to create a blended mask M~c\tilde{M}_c, then composite:

    I^c=M~cI~c+(1M~c)Ic\hat{I}_c = \tilde{M}_c \odot \tilde{I}_c + (1 - \tilde{M}_c) \odot I_c

    Finally, resize and paste I^c\hat{I}_c back to the full canvas at location CC.

3.3 Boundary Consistency Loss

To improve paste-back naturalness, supervision is upweighted near the edit boundary during training. Define a boundary band:

Bc=Dilate(Mc;rout)Erode(Mc;rin)B_c = \text{Dilate}(M_c; r_{out}) - \text{Erode}(M_c; r_{in})

Following the flow-matching denoising objective, with base loss map base=vθ(zt,t,c,zI,zref)vt22\ell_{base} = \| v_{\theta}(z_t, t, c, z_I, z_{ref}) - v_t \|_2^2, the boundary-weighted objective is:

Lboundary=E[base(1+αBc)1]\mathcal{L}_{\text{boundary}} = \mathbb{E}\left[ \| \ell_{base} \odot (1 + \alpha B_c) \|_1 \right]

3.4 Implementation Details

  • Training: Fine-tune Qwen-Image-Edit with LoRA (rank 256) on attention projections only.
  • Optimizer: AdamW (lr 2×1042 \times 10^{-4}), batch size 8, 20K steps.
  • Focus-and-Refine Parameters: Crop margin m=64m = 64; paste-back mask uses dilation kernel r=7r = 7 and Gaussian blur kernel k=11k = 11; boundary band uses rout=rin=16r_{out} = r_{in} = 16; boundary weighting uses α=9\alpha = 9.

Empirical Validation / Results

4 Refine-30K Dataset

A new dataset of 30K samples constructed to support training:

  • 20K reference-based samples: Built using a pipeline of VLM grounding (Gemini3), SAM-based segmentation (SAM3), and controlled inpainting degradations (Fig. 5). Each sample provides (I,Iref,I,M,y)(I, I_{ref}, I^\star, M, y).
  • 10K reference-free samples: Built from single images using VLM for salient object localization and degradation validation to ensure meaningful refinement tasks. Each sample provides (I,I,M,y)(I, I^\star, M, y).

5 Experiment

5.1 Benchmarks: RefineEval

A new benchmark with 67 manually curated cases (31 reference-based, 36 reference-free). Degraded inputs are synthesized via inpainting within annotated regions using multiple methods (Flux-fill, SDXL, Qwen-Edit), resulting in 402 test images total.

5.2 Evaluation Metrics

  • Reference-Based: Evaluates (i) edited-region fidelity vs. ground truth using MSE, LPIPS, SSIM, DINO, CLIP; (ii) background preservation vs. input image using MSE_bg, LPIPS_bg, SSIM_bg.
  • Reference-Free: Uses a VLM-based evaluator (Gemini2.5-Pro) to score the refined region on five dimensions: Visual Quality (VQ), Naturalness (Nat.), Aesthetics (Aes.), Fine-detail fidelity (Det.), and Instruction faithfulness (Faith.) on a [1,5] scale.

5.4 Quantitative Results

Table 1: Evaluation on Reference-Based Image Refinement.

MethodMSE ↓LP ↓VGG ↓DINO ↑CLIP ↑SSIM ↑MSE_bg ↓LP_bg ↓SSIM_bg ↑
Gemini2.50.0490.2500.5920.7170.8170.4230.2010.1030.7662
Gemini30.0310.1780.4310.7710.8550.5100.0290.0520.9061
GPT4o0.0830.3700.9180.6200.8010.3020.8150.3090.6001
OmniGen20.1550.6021.6910.3840.7170.2192.0940.6240.4300
BAGEL0.0450.2530.6110.6820.8030.4940.0330.0460.9360
Kontext0.0400.2640.5400.6850.7850.5380.0110.0190.9660
Qwen-Edit0.0490.2870.6760.6750.8070.4360.4540.1480.7530
Ours0.0200.1550.4010.7930.8850.5910.0000.0000.9997

↓: Smaller is better, ↑: Larger is better. LP = LPIPS.

RefineAnything outperforms the strongest baseline (Kontext), reducing MSE by 50% (0.020 vs. 0.040) and LPIPS by 41% (0.155 vs. 0.264), while achieving near-perfect background consistency.

Table 2: Evaluation on the Reference-Free Image Refinement.

MethodVQ ↑Nat. ↑Aes. ↑Det. ↑Faith. ↑
OmniGen22.5012.5002.4612.3482.586
BAGEL3.0183.0002.9592.8513.135
kontext1.7162.1141.9821.6901.750
Qwen-Edit3.0813.1103.1052.9753.214
Ours3.8063.8683.8763.7203.644

RefineAnything ranks first on all five VLM-evaluated criteria for reference-free refinement.

5.5 Qualitative Results

Figures 6 and 7 show that RefineAnything effectively restores subtle details (text, faces, logos) while keeping the background strictly unchanged, whereas baselines often suffer from poor background preservation, weak instruction responsiveness, and limited detail recovery.

5.6 Ablation Study

Table /3: Ablation on Focus-and-Refine and Boundary Consistency Loss.

MethodMSE ↓LP ↓VGG ↓DINO ↑CLIP ↑SSIM ↑MSE_bg ↓LP_bg ↓SSIM_bg ↑
w/o focus0.0210.1770.4490.7790.8690.5780.0050.0220.9601
w/o loss0.0230.1910.4820.7360.8580.5630.0000.0000.9997
Ours0.0200.1550.4010.7930.8850.5910.0000.0000.9997
  • Focus-and-Refine (Fig. 8): Removing it leads to weaker refinements with unresolved subtle errors.
  • Boundary Consistency Loss (Fig. 9): Removing it leads to visible seams and poor coherence at object boundaries.

Theoretical and Practical Implications

  • Theoretical: Provides a novel perspective on the limitations of fixed-resolution latent diffusion models for local tasks, demonstrating that spatial re-parameterization (crop-and-resize) can be more effective for detail recovery than processing the entire scene, even without additional information.
  • Practical: Delivers a practical, high-precision refinement tool for real-world image generation and editing workflows, particularly in domains where local detail accuracy and strict background preservation are critical (e.g., product imaging, graphic design, content editing).

Conclusion

RefineAnything is the first framework tailored for the region-specific image refinement task. By introducing the Focus-and-Refine strategy and a Boundary Consistency Loss, it significantly improves local detail fidelity and semantic alignment while achieving near-perfect background preservation. The release of the Refine-30K dataset and RefineEval benchmark supports future research in this practical area of high-precision image editing.