Summary (Overview)

  • Extreme Parameter Efficiency: Moebius is a 0.22B-parameter image inpainting specialist that matches or surpasses the generation quality of the 10B-level industrial model FLUX.1-Fill-Dev, using less than 2% of its parameters.
  • Novel Lightweight Architecture: The paper introduces the Local-λ Mix Interaction (LλMI) block, which replaces standard attention with linear-complexity Local-λ and Interactive-λ modules that summarize spatial and semantic contexts into fixed-size matrices, drastically reducing parameters while preserving representational capacity.
  • Adaptive Multi-Granularity Distillation: A training strategy that performs coarse-grained, fine-grained, and latent perceptual distillation strictly in latent space, dynamically balancing gradient-based losses to compensate for the capacity drop from extreme compression.
  • Inference Speed: Moebius achieves 26.01 ms/step latency (0.52s total for 20 steps), delivering a >15× total inference speedup over FLUX.1-Fill-Dev (50 steps, 8.05s total).
  • SOTA Performance: On Places2 (Small), Moebius achieves 0.92 FID and 0.091 LPIPS, outperforming FLUX.1-Fill-Dev (0.94 FID, 0.099 LPIPS) and all other diffusion-based methods. On CelebA-HQ (512), it achieves 5.39 FID and 0.122 LPIPS, closely matching its teacher PixelHacker (4.75 FID).

Introduction and Theoretical Foundation

Background: Image inpainting aims to reconstruct missing visual content. Recent 10B-level diffusion models (e.g., FLUX.1-Fill-Dev, SD3.5 Large-Inp.) achieve high zero-shot quality but are computationally prohibitive for deployment. The authors ask: Can a highly optimized, lightweight specialist (0.22B parameters) match the performance of 10B-level generalists?

Motivation: Existing efficient models like PixelHacker (0.86B parameters) are still too large for edge deployment. Naively substituting standard operators with lightweight ones (depthwise convolutions, linear attention) triggers a severe representation bottleneck—catastrophic quality degradation in tasks requiring semantic reasoning and spatial-texture alignment.

Theoretical Basis: The paper builds on Latent Diffusion Models (LDM) [32] and Latent Categories Guidance (LCG) [54]. LCG uses semantic embeddings E_LCG ∈ ℝ^(K×D) as global priors injected via cross-attention. The authors identify that existing linear attention (GLA [58]) lacks a formulation for cross-attention, obstructing integration with external priors.

Key Insight: Summarize spatial contexts and global semantic priors into fixed-size linear matrices (denoted λ), enabling linear-complexity self- and cross-attention equivalents. This bypasses the quadratic memory cost of dot-product attention.

Methodology

Overall Pipeline

Moebius adopts the LDM framework with a U-Net denoising backbone. Input: masked image x_m = x ⊙ (1 - m) (⊙ = Hadamard product), binary mask m∈{0,1}ᵴ . Latents: z_m = E(x_m), z = E(x). The denoising network ϵ_θ predicts noise ϵ given noisy latent z_t and timestep t. LCG embeddings E_LCG are injected via cross-attention. Teacher: PixelHacker (862M parameters). Student: Moebius (226M parameters).

Local-λ Mix Interaction (LλMI) Block

The block consists of three submodules (see Fig. 2):

  1. Local-λ (Self-Attention Equivalent): Given input latent X_l ∈ ℝ^(B×H'×W'×C), project to Q_l, K_l, V_l via 1×1 convs. Define:

    λcl=softmax(Kl)Vl,λpl=Conv3D1×r×rpos(Vl)\lambda^l_c = \text{softmax}(K_l)^\top V_l, \quad \lambda^l_p = \text{Conv3D}^\text{pos}_{1\times r\times r}(V_l)

    Output:

    Yl=Qlλcl+QlλplY_l = Q_l \lambda^l_c + Q_l \lambda^l_p

    where r=15 is the local perception window size.

  2. Interactive-λ (Cross-Attention Equivalent): For global prior E_LCG (K×D), project latent X_i to Q_i, and E_LCG to K_i, V_i. Introduce positional embedding E_pos:

    λci=softmax(Ki)Vi,λpi=EposVi\lambda^i_c = \text{softmax}(K_i)^\top V_i, \quad \lambda^i_p = E_{\text{pos}} V_i

    Output:

    Yi=Qiλci+QiλpiY_i = Q_i \lambda^i_c + Q_i \lambda^i_p
  3. Mix-FFN: A lightweight FFN with depthwise-augmented structure (instead of dense linear projections) to minimize parameters.

The full LλMI block forward pass (Eq. 3):

X1=Local-λ(LN(Xin))+XinX_1 = \text{Local-}\lambda(\text{LN}(X_{\text{in}})) + X_{\text{in}} X2=Interactive-λ(LN(X1),ELCG)+X1X_2 = \text{Interactive-}\lambda(\text{LN}(X_1), E_{\text{LCG}}) + X_1 Xout=Mix-FFN(LN(X2))+X2X_{\text{out}} = \text{Mix-FFN}(\text{LN}(X_2)) + X_2

Adaptive Multi-Granularity Distillation

Conducted entirely in latent space (no pixel-space decoding) for memory efficiency.

Losses:

  • Coarse-grained distillation (16×16 bottleneck): L_C_KD = ‖x̂_C_T - x̂_C_S‖₂²
  • Fine-grained distillation (64×64 output): L_F_KD = ‖x̂_T - x̂_S‖₂²
  • Task supervision: L_task = ‖x₀ - x̂_S‖₂²
  • Latent perceptual distillation: L_perceptual = d_E-LatentLPIPS(x₀, x̂_S)

Adaptive weighting (based on gradient norms relative to L_task):

WF_KD=G(Ltask,θF)22G(LF_KD,θF)22,Wperceptual=G(Ltask,θF)22G(Lperceptual,θF)22W_{\mathrm{F\_KD}} = \frac{\|G(L_{\text{task}}, \theta_F)\|_2^2}{\|G(L_{\mathrm{F\_KD}}, \theta_F)\|_2^2}, \quad W_{\text{perceptual}} = \frac{\|G(L_{\text{task}}, \theta_F)\|_2^2}{\|G(L_{\text{perceptual}}, \theta_F)\|_2^2}

Fine-grained output loss:

Lout=Ltask+WF_KDLF_KD+WperceptualLperceptualL_{\text{out}} = L_{\text{task}} + W_{\mathrm{F\_KD}} \cdot L_{\mathrm{F\_KD}} + W_{\text{perceptual}} \cdot L_{\text{perceptual}}

Cross-granularity weight:

WC_task=G(LC_KD,θC)22G(Lout,θC)22W_{\mathrm{C\_task}} = \frac{\|G(L_{\mathrm{C\_KD}}, \theta_C)\|_2^2}{\|G(L_{\text{out}}, \theta_C)\|_2^2}

Total loss:

Ltotal=LC_KD+WC_taskLoutL_{\text{total}} = L_{\mathrm{C\_KD}} + W_{\mathrm{C\_task}} \cdot L_{\text{out}}

Empirical Validation / Results

Efficiency Profiling

Table 1 (key metrics, reproduced):

ModelParamsTFLOPs↓Latency (ms/step)StepsTotal Time(s)↓
Moebius0.226B0.15426.01200.52
PixelHacker0.862B0.33846.89200.94
SD3.5 Large-Inp.8.057B8.657151.02284.23
FLUX.1-Fill-Dev11.902B9.927161.01508.05

Moebius is >15× faster than FLUX.1-Fill-Dev in total inference time.

Benchmark Results (Selected)

Places2 (Small) – Table 3 excerpt:

MethodFID↓LPIPS↓
Moebius0.920.091
FLUX.1-Fill-Dev0.940.099
SD3.5 Large-Inp.3.020.105
PixelHacker (teacher)0.820.088

CelebA-HQ (512) – Table 4 excerpt:

MethodFID↓LPIPS↓
Moebius5.390.122
FLUX.1-Fill-Dev10.130.141
SD3.5 Large-Inp.11.800.134
PixelHacker (teacher)4.750.115

Ablation Study

Table 2 (architectural synergy, Places2 Test):

ExpArchitectureKDFIDLPIPSParamGFLOPs
1GLA-CA-FFN, Conv32.750.298526M314.3
9Lλ-Iλ-MixFFN, DWConv26.430.258226M154.0
10Lλ-Iλ-MixFFN, DWConv33.420.312226M154.0

Table 5 (distillation objectives):

L_C_KDL_F_KDL_taskL_perceptualFIDLPIPS
74.200.367
36.170.291
32.590.273
26.430.258

User Study

Double-blind test (22 participants, 50 cases/scene): Moebius (31.76% preference) matches teacher (32.18%) and significantly beats FLUX.1-Fill-Dev (23.70%) and SD3.5 Large-Inp. (12.36%).

Theoretical and Practical Implications

  • Bridging the Scale Gap: Moebius demonstrates that extreme architectural compression (0.22B vs. 10B) combined with task-specific distillation can match industrial foundation models, proving that massive scale is unnecessary for specific restoration tasks.
  • Efficiency Paradigm Shift: The LλMI block provides a new design principle for efficient diffusion backbones—replacing quadratic attention with linear-complexity fixed-size matrix interactions that preserve both spatial and semantic reasoning.
  • Latent-Only Distillation: Performing multi-granularity distillation entirely in latent space (including perceptual losses) avoids expensive pixel-space decoding, crucial for lightweight training. The adaptive gradient-based balancing removes manual hyperparameter tuning.
  • Deployment Potential: With >15× inference speedup and <2% of the parameters, Moebius enables high-fidelity inpainting on resource-constrained devices (edge, mobile, real-time applications) without sacrificing quality.
  • Limitations: The study focuses on task-specific fine-tuning (Places2, CelebA-HQ, FFHQ) and does not evaluate zero-shot generalization to arbitrary scenes. The teacher model (PixelHacker) is also a diffusion model, so the approach inherits its biases.

Conclusion

Moebius is a 0.22B-parameter lightweight image inpainting framework that rivals the generation quality of 10B-level industrial models (FLUX.1-Fill-Dev) while being >15× faster in total inference time. The key innovations are: (1) the LλMI block, which uses Local-λ and Interactive-λ modules to summarize spatial and semantic contexts into fixed-size linear matrices, enabling linear-complexity self- and cross-attention; (2) an adaptive multi-granularity distillation strategy performed entirely in latent space, dynamically balancing gradient-based losses to recover representational capacity lost to extreme compression. Extensive experiments across natural and portrait benchmarks, plus real-world object removal, validate that the optimal synergy between architectural efficiency and distillation enables Moebius to set a new standard for high-fidelity, low-latency image inpainting. Future directions may include extending to zero-shot scenarios and further reducing model size for mobile deployment.

Related papers