CARE-Edit: Condition-Aware Routing of Experts for Contextual Image Editing

Summary (Overview)

Problem: Unified diffusion-based image editors often use a static, shared backbone for diverse editing tasks (e.g., object removal, style transfer, text-driven edits). This leads to task interference and poor adaptation to heterogeneous demands, resulting in artifacts like color bleeding, identity drift, and unpredictable behavior under multi-condition inputs.
Solution: CARE-Edit (Condition-Aware Routing of Experts) employs a lightweight latent-attention router to dynamically assign diffusion tokens to four specialized experts (Text, Mask, Reference, Base) based on multimodal conditions and diffusion timesteps. This enables adaptive allocation of model capacity.
Key Innovations: Introduces a Mask Repaint module for refining coarse user masks, a Latent Mixture module for coherently fusing expert outputs, and a Routing Select mechanism using sparse top-K selection.
Results: CARE-Edit demonstrates strong performance on instruction-driven and subject-driven contextual editing tasks, outperforming static-fusion methods like OmniControl and OmniGen2 variants. It improves edit faithfulness, boundary cleanliness, and identity/style preservation.
Analysis: Empirical studies reveal task-specific behavior of the specialized experts, validating the importance of dynamic, condition-aware processing to mitigate multi-condition conflicts.

Introduction and Theoretical Foundation

Background & Motivation: Diffusion models have revolutionized image editing, enabling tasks like localized object replacement, global style adjustment, and text-driven content insertion. However, most "unified editors" process all edits with a fixed, shared backbone. This static approach struggles to adapt to the heterogeneous demands of different editing tasks (e.g., local vs. global, semantic vs. photometric). Methods like ControlNet and OmniControl variants integrate multi-modal inputs (text, masks, reference images) via simple concatenation or additive adapters. This static fusion cannot dynamically prioritize or suppress conflicting modalities, leading to artifacts.

Theoretical Basis: The core idea is to move from static fusion to dynamic, condition-aware routing. Instead of forcing all signals through one shared pathway, CARE-Edit uses a router to dispatch tokens to specialized experts. This allows the model to selectively integrate multi-condition information across the denoising process, adapting the balance of signals (which changes over the diffusion trajectory) to reduce conflicts.

Methodology

CARE-Edit is built within a Diffusion Transformer (DiT) backbone framework. The methodology centers on three core components:

1. Condition-Aware Routing of Experts: Four heterogeneous experts are embedded as lightweight adapters in the DiT blocks:

Text Expert: Performs semantic reasoning and object synthesis via cross-attention with text tokens.
Mask Expert: Focuses on spatial precision and boundary refinement guided by the edit mask.
Reference Expert: Learns identity- and style-consistent transformations from reference image features.
Base Expert: Enforces global coherence and background consistency of the source image.

A router dynamically assigns each token to the most relevant experts. For a token h_i, the router computes expert-specific logits:

\alpha_{i,e} = \text{MLP}_e([k_i \| q]) + b_e

where k_i encodes local token information, q is a global task-condition embedding, and b_e is a learnable bias. Routing probabilities are computed via a softmax over the top-K experts:

\tilde{\pi}_{i,e} = \frac{\exp(\alpha_{i,e}/\tau) \mathbf{1}[e \in S_i]}{\sum_{j \in S_i} \exp(\alpha_{i,j}/\tau)}

Here, S_i is the set of the top-K experts for token i, and τ is a routing temperature annealed during training. A fraction of tokens is always routed through a shared expert to prevent routing collapse:

\tilde{\pi}_{i,e}^+ = (1 - \lambda_{\text{shared}}) \tilde{\pi}_{i,e} + \lambda_{\text{shared}} \mathbf{1}[e=\text{share}]

The final aggregated output for a token is:

h_i'' = h_i' + \sum_e \tilde{\pi}_{i,e}^+ (f_e(h_{i,e}') - h_i')

2. Mask Repaint Module: User-provided masks can be coarse and misaligned with object boundaries, causing artifacts. This module iteratively refines the mask at each diffusion step t by predicting a soft, boundary-aware update:

\Delta m = \sigma(W_m \text{Conv}[h'(t) \| \text{Up}(Z_r) \| \text{Up}(\hat{M}(t-1))])

The refined mask is then:

\hat{M}(t) = \text{clip}(\hat{M}(t-1) + \Delta m, 0, 1)

It is trained with a boundary-consistency loss:

L_{\text{mask}} = \|\nabla \hat{M}(t) - \nabla M_{gt}\|_1 + \lambda_{\text{smooth}} \|\nabla^2 \hat{M}(t)\|_1

3. Latent Mixture Module: This module fuses the outputs of the active experts. For each expert e, a normalized weight map w_e is derived from the routing probabilities. The fused latent is a convex combination:

h'_{\text{fuse}} = \sum_e w_e \odot h'_e

To maintain global coherence, this fused output is blended with the base expert's output via a timestep-adaptive gate γ:

\gamma = \sigma(W_\gamma[\text{GAP}(h_b) \| \psi(s)])

h'_{\text{mix}} = (1-\gamma) h'_{\text{fuse}} + \gamma h'_b

A total-variation regularizer encourages spatial smoothness in the mixture maps:

L_{\text{mix}} = \lambda_{tv} \sum_e \|\nabla w_e\|_1

Training: CARE-Edit is trained end-to-end. Only the expert adapters, router parameters, and fusion modules are optimized; the pretrained DiT backbone remains frozen. The total loss combines the standard diffusion denoising loss with auxiliary regularizers:

L_{\text{CARE}} = L_{\text{diff}} + \lambda_{\text{load}} L_{\text{load}} + \lambda_{\text{mask}} L_{\text{mask}} + \lambda_{\text{mix}} L_{\text{mix}}

A load-balancing regularizer L_load encourages balanced expert utilization. Training follows a curriculum: first on basic single-task data, then on complex multi-task samples.

Empirical Validation / Results

Benchmarks & Metrics: CARE-Edit was evaluated on:

Instruction-based editing using the EMU-Edit and MagicBrush test sets.
Subject-driven contextual editing using DreamBench++ (single- and multi-object settings). Metrics included CLIP image similarity (CLIPim), CLIP text-image alignment (CLIPout), L1 distance (L1), and DINO-based subject consistency (DINO).

Quantitative Results:

Table 1: Results on EMU-Edit and MagicBrush

Category	Method	Backbone	EMU-Edit Test	MagicBrush Test
Task-specific	PnP	SD1.5	CLIPim: 0.521, CLIPout: 0.089, L1: 0.089, DINO: 0.304	CLIPim: 0.568, CLIPout: 0.101, L1: 0.289, DINO: 0.220
Task-specific	Null-Text	SD1.5	CLIPim: 0.761, CLIPout: 0.236, L1: 0.075, DINO: 0.678	CLIPim: 0.752, CLIPout: 0.263, L1: 0.077, DINO: 0.664
Task-specific	InstructPix2Pix	SD1.5	CLIPim: 0.834, CLIPout: 0.219, L1: 0.121, DINO: 0.762	CLIPim: 0.837, CLIPout: 0.245, L1: 0.093, DINO: 0.767
Task-specific	EMU-Edit	--	CLIPim: 0.859, CLIPout: 0.231, L1: 0.094, DINO: 0.819	CLIPim: 0.897, CLIPout: 0.261, L1: 0.052, DINO: 0.879
Unified	FLUX.1 Fill	FLUX.1 Fill	CLIPim: 0.663, CLIPout: 0.205, L1: 0.176, DINO: 0.674	CLIPim: 0.725, CLIPout: 0.235, L1: 0.208, DINO: 0.661
Unified	ACE (ACE++)	FLUX.1 Fill	CLIPim: 0.831, CLIPout: 0.256, L1: 0.073, DINO: 0.802	CLIPim: 0.818, CLIPout: 0.268, L1: 0.042, DINO: 0.823
Unified	OmniGen2	FLUX.1 Dev	CLIPim: 0.865, CLIPout: 0.306, L1: 0.088, DINO: 0.832	CLIPim: 0.905, CLIPout: 0.306, L1: 0.055, DINO: 0.889
Unified	AnyEdit	SD1.5	CLIPim: 0.866, CLIPout: 0.284, L1: 0.095, DINO: 0.812	CLIPim: 0.892, CLIPout: 0.273, L1: 0.057, DINO: 0.877
Unified	CARE-Edit (Ours)	FLUX.1 Dev	CLIPim: 0.868, CLIPout: 0.313, L1: 0.082, DINO: 0.835	CLIPim: 0.894, CLIPout: 0.324, L1: 0.052, DINO: 0.885

Table 2: Results on DreamBench++

Method	Single-Object	Multiple-Object
DreamBooth	DINO-I: 0.552, CLIP-I: 0.544, CLIP-T: 0.301	DINO-I: 0.359, CLIP-I: 0.495, CLIP-T: 0.305
BLIP-Diffusion	DINO-I: 0.610, CLIP-I: 0.649, CLIP-T: 0.293	DINO-I: 0.462, CLIP-I: 0.592, CLIP-T: 0.289
OmniControl	DINO-I: 0.770, CLIP-I: 0.704, CLIP-T: 0.312	DINO-I: 0.501, CLIP-I: 0.641, CLIP-T: 0.316
UNO	DINO-I: 0.782, CLIP-I: 0.713, CLIP-T: 0.304	DINO-I: 0.508, CLIP-I: 0.649, CLIP-T: 0.303
OmniGen2	DINO-I: 0.861, CLIP-I: 0.784, CLIP-T: 0.318	DINO-I: 0.560, CLIP-I: 0.713, CLIP-T: 0.319
CARE-Edit (Ours)	DINO-I: 0.874, CLIP-I: 0.792, CLIP-T: 0.325	DINO-I: 0.568, CLIP-I: 0.720, CLIP-T: 0.327

CARE-Edit achieves competitive or superior performance across benchmarks, often outperforming strong unified editors like OmniGen2 while being trained on significantly less data (~120K vs. OmniGen2's ~1M samples).

Qualitative Results: Visual comparisons show CARE-Edit produces cleaner, more instruction-faithful edits with sharper boundaries and fewer artifacts than competing editors (see Figures 4 and 5 in the paper). It better preserves subject identity and integrates foreground objects coherently with the background context.

Ablation Study: An ablation on the challenging multi-object DreamBench++ setting isolates the impact of core components.

Table 3: Ablation Study Results

Variant	DINO-I	CLIP-I	CLIP-T
w/o Experts	0.485	0.652	0.296
w/o Latent Mixture	0.509	0.678	0.301
w/o Mask Repaint	0.523	0.693	0.304
K = 2	0.541	0.707	0.312
K = 4	0.562	0.716	0.325
Full Model (K=3)	0.568	0.720	0.327

Removing expert routing causes the largest performance drop, confirming its crucial role. Disabling Latent Mixture or Mask Repaint also degrades results. Setting K=3 (top-3 experts) yields optimal performance.

Empirical Analysis: Analysis of expert activation patterns reveals clear task-specific specialization:

The Base Expert maintains robust activation across all tasks, ensuring global consistency.
The Mask Expert dominates structure-aware edits (e.g., removal, replacement).
The Reference Expert is heavily activated during style transfer to maintain fidelity. This dynamic, task-aware activation demonstrates that CARE-Edit successfully allocates resources adaptively, unlike static fusion methods.

Visualization of the Base Expert's attention maps over training shows an evolution from copying spatial signals to semantically refining masked areas based on editing intent.

Theoretical and Practical Implications

Significance: CARE-Edit addresses a fundamental limitation in current unified image editors: their static, capacity-agnostic fusion of multiple conditioning signals. By introducing a dynamic, condition-aware routing mechanism, the model can adaptively prioritize different modalities (text, mask, reference) throughout the diffusion process, effectively mitigating conflicts that lead to artifacts.

Impact: The framework demonstrates that a modular, expert-based approach can significantly improve the fidelity and controllability of complex image editing tasks. It shows that specialized, lightweight experts can be efficiently integrated via routing to handle heterogeneous editing demands without retraining a massive backbone. This provides a more scalable and adaptable paradigm for multi-condition image generation and editing.

Conclusion

CARE-Edit presents a condition-aware routing framework that employs heterogeneous experts to tackle multi-condition conflicts in contextual image editing. It improves controllability via mask refinement and reference integration and scales with modest overhead. The method achieves strong performance on diverse editing tasks, outperforming static-fusion unified editors.

Limitations & Future Work: The approach introduces additional hyperparameters (e.g., top-K). While the current expert set covers common tasks, future work could explore extending it to handle broader edit types through dynamic expert loading or expansion.