Summary (Overview)
- Extreme Parameter Efficiency: Moebius is a 0.22B-parameter image inpainting specialist that matches or surpasses the generation quality of the 10B-level industrial model FLUX.1-Fill-Dev, using less than 2% of its parameters.
- Novel Lightweight Architecture: The paper introduces the Local-λ Mix Interaction (LλMI) block, which replaces standard attention with linear-complexity Local-λ and Interactive-λ modules that summarize spatial and semantic contexts into fixed-size matrices, drastically reducing parameters while preserving representational capacity.
- Adaptive Multi-Granularity Distillation: A training strategy that performs coarse-grained, fine-grained, and latent perceptual distillation strictly in latent space, dynamically balancing gradient-based losses to compensate for the capacity drop from extreme compression.
- Inference Speed: Moebius achieves 26.01 ms/step latency (0.52s total for 20 steps), delivering a >15× total inference speedup over FLUX.1-Fill-Dev (50 steps, 8.05s total).
- SOTA Performance: On Places2 (Small), Moebius achieves 0.92 FID and 0.091 LPIPS, outperforming FLUX.1-Fill-Dev (0.94 FID, 0.099 LPIPS) and all other diffusion-based methods. On CelebA-HQ (512), it achieves 5.39 FID and 0.122 LPIPS, closely matching its teacher PixelHacker (4.75 FID).
Introduction and Theoretical Foundation
Background: Image inpainting aims to reconstruct missing visual content. Recent 10B-level diffusion models (e.g., FLUX.1-Fill-Dev, SD3.5 Large-Inp.) achieve high zero-shot quality but are computationally prohibitive for deployment. The authors ask: Can a highly optimized, lightweight specialist (0.22B parameters) match the performance of 10B-level generalists?
Motivation: Existing efficient models like PixelHacker (0.86B parameters) are still too large for edge deployment. Naively substituting standard operators with lightweight ones (depthwise convolutions, linear attention) triggers a severe representation bottleneck—catastrophic quality degradation in tasks requiring semantic reasoning and spatial-texture alignment.
Theoretical Basis: The paper builds on Latent Diffusion Models (LDM) [32] and Latent Categories Guidance (LCG) [54]. LCG uses semantic embeddings E_LCG ∈ ℝ^(K×D) as global priors injected via cross-attention. The authors identify that existing linear attention (GLA [58]) lacks a formulation for cross-attention, obstructing integration with external priors.
Key Insight: Summarize spatial contexts and global semantic priors into fixed-size linear matrices (denoted λ), enabling linear-complexity self- and cross-attention equivalents. This bypasses the quadratic memory cost of dot-product attention.
Methodology
Overall Pipeline
Moebius adopts the LDM framework with a U-Net denoising backbone. Input: masked image x_m = x ⊙ (1 - m) (⊙ = Hadamard product), binary mask m∈{0,1}ᵴ . Latents: z_m = E(x_m), z = E(x). The denoising network ϵ_θ predicts noise ϵ given noisy latent z_t and timestep t. LCG embeddings E_LCG are injected via cross-attention. Teacher: PixelHacker (862M parameters). Student: Moebius (226M parameters).
Local-λ Mix Interaction (LλMI) Block
The block consists of three submodules (see Fig. 2):
-
Local-λ (Self-Attention Equivalent): Given input latent X_l ∈ ℝ^(B×H'×W'×C), project to Q_l, K_l, V_l via 1×1 convs. Define:
Output:
where r=15 is the local perception window size.
-
Interactive-λ (Cross-Attention Equivalent): For global prior E_LCG (K×D), project latent X_i to Q_i, and E_LCG to K_i, V_i. Introduce positional embedding E_pos:
Output:
-
Mix-FFN: A lightweight FFN with depthwise-augmented structure (instead of dense linear projections) to minimize parameters.
The full LλMI block forward pass (Eq. 3):
Adaptive Multi-Granularity Distillation
Conducted entirely in latent space (no pixel-space decoding) for memory efficiency.
Losses:
- Coarse-grained distillation (16×16 bottleneck): L_C_KD = ‖x̂_C_T - x̂_C_S‖₂²
- Fine-grained distillation (64×64 output): L_F_KD = ‖x̂_T - x̂_S‖₂²
- Task supervision: L_task = ‖x₀ - x̂_S‖₂²
- Latent perceptual distillation: L_perceptual = d_E-LatentLPIPS(x₀, x̂_S)
Adaptive weighting (based on gradient norms relative to L_task):
Fine-grained output loss:
Cross-granularity weight:
Total loss:
Empirical Validation / Results
Efficiency Profiling
Table 1 (key metrics, reproduced):
| Model | Params | TFLOPs↓ | Latency (ms/step) | Steps | Total Time(s)↓ |
|---|---|---|---|---|---|
| Moebius | 0.226B | 0.154 | 26.01 | 20 | 0.52 |
| PixelHacker | 0.862B | 0.338 | 46.89 | 20 | 0.94 |
| SD3.5 Large-Inp. | 8.057B | 8.657 | 151.02 | 28 | 4.23 |
| FLUX.1-Fill-Dev | 11.902B | 9.927 | 161.01 | 50 | 8.05 |
Moebius is >15× faster than FLUX.1-Fill-Dev in total inference time.
Benchmark Results (Selected)
Places2 (Small) – Table 3 excerpt:
| Method | FID↓ | LPIPS↓ |
|---|---|---|
| Moebius | 0.92 | 0.091 |
| FLUX.1-Fill-Dev | 0.94 | 0.099 |
| SD3.5 Large-Inp. | 3.02 | 0.105 |
| PixelHacker (teacher) | 0.82 | 0.088 |
CelebA-HQ (512) – Table 4 excerpt:
| Method | FID↓ | LPIPS↓ |
|---|---|---|
| Moebius | 5.39 | 0.122 |
| FLUX.1-Fill-Dev | 10.13 | 0.141 |
| SD3.5 Large-Inp. | 11.80 | 0.134 |
| PixelHacker (teacher) | 4.75 | 0.115 |
Ablation Study
Table 2 (architectural synergy, Places2 Test):
| Exp | Architecture | KD | FID | LPIPS | Param | GFLOPs |
|---|---|---|---|---|---|---|
| 1 | GLA-CA-FFN, Conv | ✗ | 32.75 | 0.298 | 526M | 314.3 |
| 9 | Lλ-Iλ-MixFFN, DWConv | ✓ | 26.43 | 0.258 | 226M | 154.0 |
| 10 | Lλ-Iλ-MixFFN, DWConv | ✗ | 33.42 | 0.312 | 226M | 154.0 |
Table 5 (distillation objectives):
| L_C_KD | L_F_KD | L_task | L_perceptual | FID | LPIPS |
|---|---|---|---|---|---|
| ✓ | 74.20 | 0.367 | |||
| ✓ | ✓ | 36.17 | 0.291 | ||
| ✓ | ✓ | ✓ | 32.59 | 0.273 | |
| ✓ | ✓ | ✓ | ✓ | 26.43 | 0.258 |
User Study
Double-blind test (22 participants, 50 cases/scene): Moebius (31.76% preference) matches teacher (32.18%) and significantly beats FLUX.1-Fill-Dev (23.70%) and SD3.5 Large-Inp. (12.36%).
Theoretical and Practical Implications
- Bridging the Scale Gap: Moebius demonstrates that extreme architectural compression (0.22B vs. 10B) combined with task-specific distillation can match industrial foundation models, proving that massive scale is unnecessary for specific restoration tasks.
- Efficiency Paradigm Shift: The LλMI block provides a new design principle for efficient diffusion backbones—replacing quadratic attention with linear-complexity fixed-size matrix interactions that preserve both spatial and semantic reasoning.
- Latent-Only Distillation: Performing multi-granularity distillation entirely in latent space (including perceptual losses) avoids expensive pixel-space decoding, crucial for lightweight training. The adaptive gradient-based balancing removes manual hyperparameter tuning.
- Deployment Potential: With >15× inference speedup and <2% of the parameters, Moebius enables high-fidelity inpainting on resource-constrained devices (edge, mobile, real-time applications) without sacrificing quality.
- Limitations: The study focuses on task-specific fine-tuning (Places2, CelebA-HQ, FFHQ) and does not evaluate zero-shot generalization to arbitrary scenes. The teacher model (PixelHacker) is also a diffusion model, so the approach inherits its biases.
Conclusion
Moebius is a 0.22B-parameter lightweight image inpainting framework that rivals the generation quality of 10B-level industrial models (FLUX.1-Fill-Dev) while being >15× faster in total inference time. The key innovations are: (1) the LλMI block, which uses Local-λ and Interactive-λ modules to summarize spatial and semantic contexts into fixed-size linear matrices, enabling linear-complexity self- and cross-attention; (2) an adaptive multi-granularity distillation strategy performed entirely in latent space, dynamically balancing gradient-based losses to recover representational capacity lost to extreme compression. Extensive experiments across natural and portrait benchmarks, plus real-world object removal, validate that the optimal synergy between architectural efficiency and distillation enables Moebius to set a new standard for high-fidelity, low-latency image inpainting. Future directions may include extending to zero-shot scenarios and further reducing model size for mobile deployment.
Related papers
- SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
A persistent Python kernel as an action interface yields 59.9% accuracy, outperforming prior spatial agents by 11 points without adaptation.
- ABot-Earth 0.5: Generative 3D Earth Model
ABot-Earth 0.5 generates seamless real-world 3D environments from satellite imagery at under 10 min/km² with FID 16.1.
- Kwai Keye-VL-2.0 Technical Report
First multimodal MoE achieves SOTA long-video understanding and agentic tasks with 3B active parameters via sparse attention and multi-teacher distillation.