# Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

> Moebius matches 10B-level inpainting quality with 0.22B parameters and 15× speedup, using novel linear-complexity Local-λ attention blocks.

- **Source:** [arXiv](https://arxiv.org/abs/2606.19195)
- **Published:** 2026-06-20
- **Permalink:** https://picx.dev/p/nHYYy2
- **Whiteboard:** https://picx.dev/p/nHYYy2/image

## Summary

## Summary (Overview)

- **Extreme Parameter Efficiency**: Moebius is a 0.22B-parameter image inpainting specialist that matches or surpasses the generation quality of the 10B-level industrial model FLUX.1-Fill-Dev, using less than 2% of its parameters.
- **Novel Lightweight Architecture**: The paper introduces the Local-λ Mix Interaction (LλMI) block, which replaces standard attention with linear-complexity Local-λ and Interactive-λ modules that summarize spatial and semantic contexts into fixed-size matrices, drastically reducing parameters while preserving representational capacity.
- **Adaptive Multi-Granularity Distillation**: A training strategy that performs coarse-grained, fine-grained, and latent perceptual distillation strictly in latent space, dynamically balancing gradient-based losses to compensate for the capacity drop from extreme compression.
- **Inference Speed**: Moebius achieves 26.01 ms/step latency (0.52s total for 20 steps), delivering a >15× total inference speedup over FLUX.1-Fill-Dev (50 steps, 8.05s total).
- **SOTA Performance**: On Places2 (Small), Moebius achieves 0.92 FID and 0.091 LPIPS, outperforming FLUX.1-Fill-Dev (0.94 FID, 0.099 LPIPS) and all other diffusion-based methods. On CelebA-HQ (512), it achieves 5.39 FID and 0.122 LPIPS, closely matching its teacher PixelHacker (4.75 FID).

## Introduction and Theoretical Foundation

**Background**: Image inpainting aims to reconstruct missing visual content. Recent 10B-level diffusion models (e.g., FLUX.1-Fill-Dev, SD3.5 Large-Inp.) achieve high zero-shot quality but are computationally prohibitive for deployment. The authors ask: Can a highly optimized, lightweight specialist (0.22B parameters) match the performance of 10B-level generalists?

**Motivation**: Existing efficient models like PixelHacker (0.86B parameters) are still too large for edge deployment. Naively substituting standard operators with lightweight ones (depthwise convolutions, linear attention) triggers a severe representation bottleneck—catastrophic quality degradation in tasks requiring semantic reasoning and spatial-texture alignment.

**Theoretical Basis**: The paper builds on Latent Diffusion Models (LDM) [32] and Latent Categories Guidance (LCG) [54]. LCG uses semantic embeddings E_LCG ∈ ℝ^(K×D) as global priors injected via cross-attention. The authors identify that existing linear attention (GLA [58]) lacks a formulation for cross-attention, obstructing integration with external priors.

**Key Insight**: Summarize spatial contexts and global semantic priors into fixed-size linear matrices (denoted λ), enabling linear-complexity self- and cross-attention equivalents. This bypasses the quadratic memory cost of dot-product attention.

## Methodology

### Overall Pipeline
Moebius adopts the LDM framework with a U-Net denoising backbone. Input: masked image x_m = x ⊙ (1 - m) (⊙ = Hadamard product), binary mask m∈{0,1}ᵴ . Latents: z_m = E(x_m), z = E(x). The denoising network ϵ_θ predicts noise ϵ given noisy latent z_t and timestep t. LCG embeddings E_LCG are injected via cross-attention. Teacher: PixelHacker (862M parameters). Student: Moebius (226M parameters).

### Local-λ Mix Interaction (LλMI) Block
The block consists of three submodules (see Fig. 2):

1. **Local-λ (Self-Attention Equivalent)**: Given input latent X_l ∈ ℝ^(B×H'×W'×C), project to Q_l, K_l, V_l via 1×1 convs. Define:
   
   $$\lambda^l_c = \text{softmax}(K_l)^\top V_l, \quad \lambda^l_p = \text{Conv3D}^\text{pos}_{1\times r\times r}(V_l)$$
   
   Output:
   
   $$Y_l = Q_l \lambda^l_c + Q_l \lambda^l_p$$
   
   where r=15 is the local perception window size.

2. **Interactive-λ (Cross-Attention Equivalent)**: For global prior E_LCG (K×D), project latent X_i to Q_i, and E_LCG to K_i, V_i. Introduce positional embedding E_pos:
   
   $$\lambda^i_c = \text{softmax}(K_i)^\top V_i, \quad \lambda^i_p = E_{\text{pos}} V_i$$
   
   Output:
   
   $$Y_i = Q_i \lambda^i_c + Q_i \lambda^i_p$$

3. **Mix-FFN**: A lightweight FFN with depthwise-augmented structure (instead of dense linear projections) to minimize parameters.

The full LλMI block forward pass (Eq. 3):
$$X_1 = \text{Local-}\lambda(\text{LN}(X_{\text{in}})) + X_{\text{in}}$$
$$X_2 = \text{Interactive-}\lambda(\text{LN}(X_1), E_{\text{LCG}}) + X_1$$
$$X_{\text{out}} = \text{Mix-FFN}(\text{LN}(X_2)) + X_2$$

### Adaptive Multi-Granularity Distillation
Conducted entirely in latent space (no pixel-space decoding) for memory efficiency.

**Losses**:
- Coarse-grained distillation (16×16 bottleneck): L_C_KD = ‖x̂_C_T - x̂_C_S‖₂²
- Fine-grained distillation (64×64 output): L_F_KD = ‖x̂_T - x̂_S‖₂²
- Task supervision: L_task = ‖x₀ - x̂_S‖₂²
- Latent perceptual distillation: L_perceptual = d_E-LatentLPIPS(x₀, x̂_S)

**Adaptive weighting** (based on gradient norms relative to L_task):
$$W_{\text{F\_KD}} = \frac{\|G(L_{\text{task}}, \theta_F)\|_2^2}{\|G(L_{\text{F\_KD}}, \theta_F)\|_2^2}, \quad W_{\text{perceptual}} = \frac{\|G(L_{\text{task}}, \theta_F)\|_2^2}{\|G(L_{\text{perceptual}}, \theta_F)\|_2^2}$$

Fine-grained output loss:
$$L_{\text{out}} = L_{\text{task}} + W_{\text{F\_KD}} \cdot L_{\text{F\_KD}} + W_{\text{perceptual}} \cdot L_{\text{perceptual}}$$

Cross-granularity weight:
$$W_{\text{C\_task}} = \frac{\|G(L_{\text{C\_KD}}, \theta_C)\|_2^2}{\|G(L_{\text{out}}, \theta_C)\|_2^2}$$

Total loss:
$$L_{\text{total}} = L_{\text{C\_KD}} + W_{\text{C\_task}} \cdot L_{\text{out}}$$

## Empirical Validation / Results

### Efficiency Profiling
Table 1 (key metrics, reproduced):
| Model | Params | TFLOPs↓ | Latency (ms/step) | Steps | Total Time(s)↓ |
|-------|--------|---------|-------------------|-------|----------------|
| Moebius | 0.226B | 0.154 | 26.01 | 20 | 0.52 |
| PixelHacker | 0.862B | 0.338 | 46.89 | 20 | 0.94 |
| SD3.5 Large-Inp. | 8.057B | 8.657 | 151.02 | 28 | 4.23 |
| FLUX.1-Fill-Dev | 11.902B | 9.927 | 161.01 | 50 | 8.05 |

Moebius is >15× faster than FLUX.1-Fill-Dev in total inference time.

### Benchmark Results (Selected)
**Places2 (Small)** – Table 3 excerpt:
| Method | FID↓ | LPIPS↓ |
|--------|------|--------|
| Moebius | **0.92** | **0.091** |
| FLUX.1-Fill-Dev | 0.94 | 0.099 |
| SD3.5 Large-Inp. | 3.02 | 0.105 |
| PixelHacker (teacher) | 0.82 | 0.088 |

**CelebA-HQ (512)** – Table 4 excerpt:
| Method | FID↓ | LPIPS↓ |
|--------|------|--------|
| Moebius | **5.39** | **0.122** |
| FLUX.1-Fill-Dev | 10.13 | 0.141 |
| SD3.5 Large-Inp. | 11.80 | 0.134 |
| PixelHacker (teacher) | 4.75 | 0.115 |

### Ablation Study
Table 2 (architectural synergy, Places2 Test):
| Exp | Architecture | KD | FID | LPIPS | Param | GFLOPs |
|-----|-------------|----|-----|-------|-------|--------|
| 1 | GLA-CA-FFN, Conv | ✗ | 32.75 | 0.298 | 526M | 314.3 |
| 9 | Lλ-Iλ-MixFFN, DWConv | ✓ | **26.43** | **0.258** | **226M** | **154.0** |
| 10 | Lλ-Iλ-MixFFN, DWConv | ✗ | 33.42 | 0.312 | 226M | 154.0 |

Table 5 (distillation objectives):
| L_C_KD | L_F_KD | L_task | L_perceptual | FID | LPIPS |
|--------|--------|--------|--------------|-----|-------|
| ✓ | | | | 74.20 | 0.367 |
| ✓ | ✓ | | | 36.17 | 0.291 |
| ✓ | ✓ | ✓ | | 32.59 | 0.273 |
| ✓ | ✓ | ✓ | ✓ | **26.43** | **0.258** |

### User Study
Double-blind test (22 participants, 50 cases/scene): Moebius (31.76% preference) matches teacher (32.18%) and significantly beats FLUX.1-Fill-Dev (23.70%) and SD3.5 Large-Inp. (12.36%).

## Theoretical and Practical Implications

- **Bridging the Scale Gap**: Moebius demonstrates that extreme architectural compression (0.22B vs. 10B) combined with task-specific distillation can match industrial foundation models, proving that massive scale is unnecessary for specific restoration tasks.
- **Efficiency Paradigm Shift**: The LλMI block provides a new design principle for efficient diffusion backbones—replacing quadratic attention with linear-complexity fixed-size matrix interactions that preserve both spatial and semantic reasoning.
- **Latent-Only Distillation**: Performing multi-granularity distillation entirely in latent space (including perceptual losses) avoids expensive pixel-space decoding, crucial for lightweight training. The adaptive gradient-based balancing removes manual hyperparameter tuning.
- **Deployment Potential**: With >15× inference speedup and <2% of the parameters, Moebius enables high-fidelity inpainting on resource-constrained devices (edge, mobile, real-time applications) without sacrificing quality.
- **Limitations**: The study focuses on task-specific fine-tuning (Places2, CelebA-HQ, FFHQ) and does not evaluate zero-shot generalization to arbitrary scenes. The teacher model (PixelHacker) is also a diffusion model, so the approach inherits its biases.

## Conclusion

Moebius is a 0.22B-parameter lightweight image inpainting framework that rivals the generation quality of 10B-level industrial models (FLUX.1-Fill-Dev) while being >15× faster in total inference time. The key innovations are: (1) the LλMI block, which uses Local-λ and Interactive-λ modules to summarize spatial and semantic contexts into fixed-size linear matrices, enabling linear-complexity self- and cross-attention; (2) an adaptive multi-granularity distillation strategy performed entirely in latent space, dynamically balancing gradient-based losses to recover representational capacity lost to extreme compression. Extensive experiments across natural and portrait benchmarks, plus real-world object removal, validate that the optimal synergy between architectural efficiency and distillation enables Moebius to set a new standard for high-fidelity, low-latency image inpainting. Future directions may include extending to zero-shot scenarios and further reducing model size for mobile deployment.

---

_Markdown view of https://picx.dev/p/nHYYy2, served by PicX — AI-generated visual whiteboard summaries of research papers._
