Visual Summary | Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Summary (Overview)

Extreme Parameter Efficiency: Moebius is a 0.22B-parameter image inpainting specialist that matches or surpasses the generation quality of the 10B-level industrial model FLUX.1-Fill-Dev, using less than 2% of its parameters.
Novel Lightweight Architecture: The paper introduces the Local-λ Mix Interaction (LλMI) block, which replaces standard attention with linear-complexity Local-λ and Interactive-λ modules that summarize spatial and semantic contexts into fixed-size matrices, drastically reducing parameters while preserving representational capacity.
Adaptive Multi-Granularity Distillation: A training strategy that performs coarse-grained, fine-grained, and latent perceptual distillation strictly in latent space, dynamically balancing gradient-based losses to compensate for the capacity drop from extreme compression.
Inference Speed: Moebius achieves 26.01 ms/step latency (0.52s total for 20 steps), delivering a >15× total inference speedup over FLUX.1-Fill-Dev (50 steps, 8.05s total).
SOTA Performance: On Places2 (Small), Moebius achieves 0.92 FID and 0.091 LPIPS, outperforming FLUX.1-Fill-Dev (0.94 FID, 0.099 LPIPS) and all other diffusion-based methods. On CelebA-HQ (512), it achieves 5.39 FID and 0.122 LPIPS, closely matching its teacher PixelHacker (4.75 FID).

Introduction and Theoretical Foundation

Background: Image inpainting aims to reconstruct missing visual content. Recent 10B-level diffusion models (e.g., FLUX.1-Fill-Dev, SD3.5 Large-Inp.) achieve high zero-shot quality but are computationally prohibitive for deployment. The authors ask: Can a highly optimized, lightweight specialist (0.22B parameters) match the performance of 10B-level generalists?

Motivation: Existing efficient models like PixelHacker (0.86B parameters) are still too large for edge deployment. Naively substituting standard operators with lightweight ones (depthwise convolutions, linear attention) triggers a severe representation bottleneck—catastrophic quality degradation in tasks requiring semantic reasoning and spatial-texture alignment.

Theoretical Basis: The paper builds on Latent Diffusion Models (LDM) [32] and Latent Categories Guidance (LCG) [54]. LCG uses semantic embeddings E_LCG ∈ ℝ^(K×D) as global priors injected via cross-attention. The authors identify that existing linear attention (GLA [58]) lacks a formulation for cross-attention, obstructing integration with external priors.

Key Insight: Summarize spatial contexts and global semantic priors into fixed-size linear matrices (denoted λ), enabling linear-complexity self- and cross-attention equivalents. This bypasses the quadratic memory cost of dot-product attention.

Methodology

Overall Pipeline

Moebius adopts the LDM framework with a U-Net denoising backbone. Input: masked image x_m = x ⊙ (1 - m) (⊙ = Hadamard product), binary mask m∈{0,1}ᵴ . Latents: z_m = E(x_m), z = E(x). The denoising network ϵ_θ predicts noise ϵ given noisy latent z_t and timestep t. LCG embeddings E_LCG are injected via cross-attention. Teacher: PixelHacker (862M parameters). Student: Moebius (226M parameters).

Local-λ Mix Interaction (LλMI) Block

The block consists of three submodules (see Fig. 2):

Local-λ (Self-Attention Equivalent): Given input latent X_l ∈ ℝ^(B×H'×W'×C), project to Q_l, K_l, V_l via 1×1 convs. Define:
$\lambda^l_c = \text{softmax}(K_l)^\top V_l, \quad \lambda^l_p = \text{Conv3D}^\text{pos}_{1\times r\times r}(V_l)$
Output:
$Y_l = Q_l \lambda^l_c + Q_l \lambda^l_p$
where r=15 is the local perception window size.
Interactive-λ (Cross-Attention Equivalent): For global prior E_LCG (K×D), project latent X_i to Q_i, and E_LCG to K_i, V_i. Introduce positional embedding E_pos:
$\lambda^i_c = \text{softmax}(K_i)^\top V_i, \quad \lambda^i_p = E_{\text{pos}} V_i$
Output:
$Y_i = Q_i \lambda^i_c + Q_i \lambda^i_p$
Mix-FFN: A lightweight FFN with depthwise-augmented structure (instead of dense linear projections) to minimize parameters.

The full LλMI block forward pass (Eq. 3):

X_1 = \text{Local-}\lambda(\text{LN}(X_{\text{in}})) + X_{\text{in}}

X_2 = \text{Interactive-}\lambda(\text{LN}(X_1), E_{\text{LCG}}) + X_1

X_{\text{out}} = \text{Mix-FFN}(\text{LN}(X_2)) + X_2

Adaptive Multi-Granularity Distillation

Conducted entirely in latent space (no pixel-space decoding) for memory efficiency.

Losses:

Coarse-grained distillation (16×16 bottleneck): L_C_KD = ‖x̂_C_T - x̂_C_S‖₂²
Fine-grained distillation (64×64 output): L_F_KD = ‖x̂_T - x̂_S‖₂²
Task supervision: L_task = ‖x₀ - x̂_S‖₂²
Latent perceptual distillation: L_perceptual = d_E-LatentLPIPS(x₀, x̂_S)

Adaptive weighting (based on gradient norms relative to L_task):

W_{\mathrm{F\_KD}} = \frac{\|G(L_{\text{task}}, \theta_F)\|_2^2}{\|G(L_{\mathrm{F\_KD}}, \theta_F)\|_2^2}, \quad W_{\text{perceptual}} = \frac{\|G(L_{\text{task}}, \theta_F)\|_2^2}{\|G(L_{\text{perceptual}}, \theta_F)\|_2^2}

Fine-grained output loss:

L_{\text{out}} = L_{\text{task}} + W_{\mathrm{F\_KD}} \cdot L_{\mathrm{F\_KD}} + W_{\text{perceptual}} \cdot L_{\text{perceptual}}

Cross-granularity weight:

W_{\mathrm{C\_task}} = \frac{\|G(L_{\mathrm{C\_KD}}, \theta_C)\|_2^2}{\|G(L_{\text{out}}, \theta_C)\|_2^2}

Total loss:

L_{\text{total}} = L_{\mathrm{C\_KD}} + W_{\mathrm{C\_task}} \cdot L_{\text{out}}

Empirical Validation / Results

Efficiency Profiling

Table 1 (key metrics, reproduced):

Model	Params	TFLOPs↓	Latency (ms/step)	Steps	Total Time(s)↓
Moebius	0.226B	0.154	26.01	20	0.52
PixelHacker	0.862B	0.338	46.89	20	0.94
SD3.5 Large-Inp.	8.057B	8.657	151.02	28	4.23
FLUX.1-Fill-Dev	11.902B	9.927	161.01	50	8.05

Moebius is >15× faster than FLUX.1-Fill-Dev in total inference time.

Benchmark Results (Selected)

Places2 (Small) – Table 3 excerpt:

Method	FID↓	LPIPS↓
Moebius	0.92	0.091
FLUX.1-Fill-Dev	0.94	0.099
SD3.5 Large-Inp.	3.02	0.105
PixelHacker (teacher)	0.82	0.088

CelebA-HQ (512) – Table 4 excerpt:

Method	FID↓	LPIPS↓
Moebius	5.39	0.122
FLUX.1-Fill-Dev	10.13	0.141
SD3.5 Large-Inp.	11.80	0.134
PixelHacker (teacher)	4.75	0.115

Ablation Study

Table 2 (architectural synergy, Places2 Test):

Exp	Architecture	KD	FID	LPIPS	Param	GFLOPs
1	GLA-CA-FFN, Conv	✗	32.75	0.298	526M	314.3
9	Lλ-Iλ-MixFFN, DWConv	✓	26.43	0.258	226M	154.0
10	Lλ-Iλ-MixFFN, DWConv	✗	33.42	0.312	226M	154.0

Table 5 (distillation objectives):

L_C_KD	L_F_KD	L_task	L_perceptual	FID	LPIPS
✓				74.20	0.367
✓	✓			36.17	0.291
✓	✓	✓		32.59	0.273
✓	✓	✓	✓	26.43	0.258

User Study

Double-blind test (22 participants, 50 cases/scene): Moebius (31.76% preference) matches teacher (32.18%) and significantly beats FLUX.1-Fill-Dev (23.70%) and SD3.5 Large-Inp. (12.36%).

Theoretical and Practical Implications

Bridging the Scale Gap: Moebius demonstrates that extreme architectural compression (0.22B vs. 10B) combined with task-specific distillation can match industrial foundation models, proving that massive scale is unnecessary for specific restoration tasks.
Efficiency Paradigm Shift: The LλMI block provides a new design principle for efficient diffusion backbones—replacing quadratic attention with linear-complexity fixed-size matrix interactions that preserve both spatial and semantic reasoning.
Latent-Only Distillation: Performing multi-granularity distillation entirely in latent space (including perceptual losses) avoids expensive pixel-space decoding, crucial for lightweight training. The adaptive gradient-based balancing removes manual hyperparameter tuning.
Deployment Potential: With >15× inference speedup and <2% of the parameters, Moebius enables high-fidelity inpainting on resource-constrained devices (edge, mobile, real-time applications) without sacrificing quality.
Limitations: The study focuses on task-specific fine-tuning (Places2, CelebA-HQ, FFHQ) and does not evaluate zero-shot generalization to arbitrary scenes. The teacher model (PixelHacker) is also a diffusion model, so the approach inherits its biases.

Conclusion

Moebius is a 0.22B-parameter lightweight image inpainting framework that rivals the generation quality of 10B-level industrial models (FLUX.1-Fill-Dev) while being >15× faster in total inference time. The key innovations are: (1) the LλMI block, which uses Local-λ and Interactive-λ modules to summarize spatial and semantic contexts into fixed-size linear matrices, enabling linear-complexity self- and cross-attention; (2) an adaptive multi-granularity distillation strategy performed entirely in latent space, dynamically balancing gradient-based losses to recover representational capacity lost to extreme compression. Extensive experiments across natural and portrait benchmarks, plus real-world object removal, validate that the optimal synergy between architectural efficiency and distillation enables Moebius to set a new standard for high-fidelity, low-latency image inpainting. Future directions may include extending to zero-shot scenarios and further reducing model size for mobile deployment.