Lance: Unified Multimodal Modeling by Multi-Task Synergy - Summary

Summary (Overview)

  • Core Contribution: Presents Lance, a lightweight (3B activated parameters) native unified multimodal model that supports the full spectrum of image and video tasks—understanding, generation, and editing—within a single framework.
  • Key Principles: Built on unified context learning (shared interleaved multimodal sequences) and decoupled capability pathways (specialized processing for understanding vs. generation).
  • Architectural Innovation: Employs a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL, with separate experts for understanding (LLM_UND) and generation (LLM_GEN), and introduces Modality-Aware Rotary Positional Encoding (MaPE) to mitigate interference between heterogeneous visual tokens.
  • Training Paradigm: Uses a staged multi-task training strategy (Pre-Training, Continual Training, Supervised Fine-Tuning, Reinforcement Learning) with adaptive data scheduling to harness cross-task synergy and promote transfer.
  • Strong Performance: Achieves competitive or state-of-the-art results on image/video generation (GenEval, DPG-Bench, VBench) and understanding (MVBench) benchmarks, outperforming larger open-source unified models, all trained with a resource-efficient 128-GPU budget.

Introduction and Theoretical Foundation

The field of multimodal AI is moving towards native unified models that integrate understanding, reasoning, and generation. However, a fundamental challenge remains: the visual representation requirements for understanding (high-level semantic features aligned with language) and generation (low-level continuous representations preserving texture and dynamics) are inherently misaligned. Existing unified models often struggle to balance these needs or have limited task coverage, largely confined to text-image domains.

Lance is motivated by the observation that models with broader task coverage (see Table 1) are more likely to exhibit emergent generalization on unseen tasks. This suggests multi-task learning is not just capability aggregation but a mechanism for promoting cross-modal and cross-task transfer. Lance's design is grounded in two core principles to address the representational mismatch:

  1. Unified Context Learning: Enables different tasks to interact within a shared interleaved multimodal sequence.
  2. Decoupled Capability Pathways: Mitigates interference by allocating dedicated capacity and representations to understanding and generation.

Methodology

Overall Architecture

Lance employs a dual-expert architecture over a shared interleaved multimodal sequence (see Figure 6).

  • Input Tokenization:
    • Text: Embedded using Qwen2.5-VL's language embedding layer.
    • Understanding Visual Inputs: Encoded by the Qwen2.5-VL ViT encoder to produce compact semantic visual tokens.
    • Generation Visual Inputs: Encoded by the Wan2.2 3D causal VAE encoder into continuous VAE latent tokens (clean and noisy), projected via an MLP connector.
  • Unified Sequence Formulation: The sample is represented as a sequence SS: S=Btext(T)Bvis(Vvit)Bvis(Vvaeclean)Bvis(Vvaenoisy)Btext(T)S = \cdots \oplus B_{\text{text}}(T) \oplus B_{\text{vis}}(V_{\text{vit}}) \oplus B_{\text{vis}}(V_{\text{vae}}^{\text{clean}}) \oplus B_{\text{vis}}(V_{\text{vae}}^{\text{noisy}}) \oplus B_{\text{text}}(T') \oplus \cdots where Btext(T)=[BOT,T,EOT]B_{\text{text}}(T) = [\text{BOT}, T, \text{EOT}] and Bvis(V)=[BOV,V,EOV]B_{\text{vis}}(V) = [\text{BOV}, V, \text{EOV}].
  • Attention: Uses generalized 3D causal attention—causal across segments, bidirectional within visual segments.

Decoupled Pathways and Objectives

  • Understanding Expert (LLM_UND): Processes text and semantic visual tokens. Its hidden states are mapped by an LM head and optimized with next-token prediction loss: LUND=ilogpθUND(yiy<i,S)\mathcal{L}_{\text{UND}} = -\sum_i \log p_{\theta_{\text{UND}}}(y_i | y_{<i}, S)
  • Generation Expert (LLM_GEN): Processes VAE latent tokens. Its hidden states are projected and passed to a flow prediction head, optimized with: LGEN=Ex0,x1,t[vθGEN(xt,S,t)(x1x0)22]\mathcal{L}_{\text{GEN}} = \mathbb{E}_{x_0, x_1, t} \left[ \| v_{\theta_{\text{GEN}}}(x_t, S, t) - (x_1 - x_0) \|_2^2 \right] where xt=tx1+(1t)x0x_t = t x_1 + (1-t)x_0, tU(0,1)t \sim \mathcal{U}(0,1).
  • Overall Objective: L=λuLUND+λgLGEN\mathcal{L} = \lambda_u \mathcal{L}_{\text{UND}} + \lambda_g \mathcal{L}_{\text{GEN}}

Modality-Aware Rotary Positional Encoding (MaPE)

Standard 3D-RoPE assigns positions based on spatiotemporal layout, creating ambiguity when multiple visual token groups (ViT semantic, clean VAE, noisy VAE) coexist. MaPE injects modality awareness by applying a modality-specific offset Δm\Delta_m only along the temporal dimension:

pt,h,w(m)=p^t,h,w(m)+[Δm,0,0]=[t^t,h,w(m)+Δm,h^t,h,w(m),w^t,h,w(m)]p^{(m)}_{t,h,w} = \hat{p}^{(m)}_{t,h,w} + [\Delta_m, 0, 0] = [\hat{t}^{(m)}_{t,h,w} + \Delta_m, \hat{h}^{(m)}_{t,h,w}, \hat{w}^{(m)}_{t,h,w}]

This explicitly separates functional roles while preserving spatial layouts and temporal coherence within groups (see Figure 7).

Empirical Validation / Results

Experimental Setup

  • Base Model: Implemented upon Qwen2.5-VL 3B.
  • Visual Encoders: Qwen2.5-VL ViT (understanding), Wan2.2 3D causal VAE (generation).
  • Training Stages: Detailed hyperparameters are provided in Table 2.

Table 2: Training Hyperparameters of Lance

HyperparameterPTCTSFTRL
Learning rate1.0×1041.0 \times 10^{-4}1.0×1041.0 \times 10^{-4}2.5×1052.5 \times 10^{-5}2.0×1062.0 \times 10^{-6}
LR schedulerConstantConstantCosineConstant
Training steps350k80k.5k800
# Seen training tokens1.5T300B72B0.5B
Max context window40k70k70k70k

Main Results

1. Image Generation

  • Quantitative (Table 5): On GenEval, Lance achieves an overall score of 0.90, matching the best among unified models, with strong performance on counting, colors, and position. On DPG-Bench, it obtains competitive results (84.67 overall), excelling in relation modeling.
  • Qualitative (Figure 10): Lance generates higher-quality images with better text alignment and aesthetics compared to open-source unified baselines (Bagel, InternVL-U) and is comparable to the 20B Qwen-Image and commercial Nano Banana.

2. Video Generation

  • Quantitative (Table 6): On VBench, Lance achieves the best Total Score (85.11) among unified models, leading in metrics like object grounding, spatial relations, and scene understanding.
  • Qualitative (Figure 11): Generated videos show strong semantic fidelity, coherent motion, and accurate camera transition following.

3. Multimodal Editing

  • Quantitative (Table 7): On GEdit-Bench, Lance achieves the best average score (7.30) among unified models, excelling in categories like background change, material modification, and subject removal.
  • Qualitative (Figure 12): Demonstrates visually coherent image editing and temporally consistent video editing with natural motion dynamics.

4. Multimodal Understanding

  • Quantitative (Table 8): On MVBench, Lance achieves the highest overall score (62.0) among unified models, a ~11.3% relative improvement over the second-best (Show-o2 7B).
  • Qualitative (Figures 3 & 5): Handles diverse tasks including OCR, knowledge QA, multi-image motion analysis, and detailed video captioning.

Theoretical and Practical Implications

Theoretical Implications:

  • Validates that multi-task synergy is a powerful mechanism for enhancing unified multimodal modeling, as joint training on diverse tasks leads to mutual reinforcement and improved performance even on base capabilities like generation.
  • Demonstrates the effectiveness of the unified context + decoupled pathways design principle for balancing the conflicting requirements of understanding and generation.
  • Shows that capable unified models covering the full image-video task space can be built in a resource-efficient manner (3B params, 128 GPUs), challenging the notion that such performance requires massive scaling.

Practical Implications:

  • Lance provides a practical open-source foundation model for a wide range of multimodal applications—from content creation (generation/editing) to visual analysis (understanding)—within a single, lightweight model.
  • The staged training paradigm and architectural innovations (MaPE, dual-experts) offer a blueprint for developing efficient unified multimodal systems.
  • Strong performance on editing and subject-driven generation tasks highlights potential for controllable and customizable content creation tools.

Conclusion

Lance is a lightweight native unified multimodal model that successfully integrates image and video understanding, generation, and editing. Its key innovations—dual-stream mixture-of-experts architecture, Modality-Aware Rotary Positional Encoding (MaPE), and staged multi-task training—enable it to harness cross-task synergy and achieve strong performance across benchmarks. The work demonstrates that broad multi-task learning is crucial for advancing unified multimodal modeling and that efficient, capable unified models are feasible.

Future Directions:

  • Post-training: Developing video-aware reward models for reinforcement learning.
  • Model Scaling: Scaling capacity, expert count, and context length.
  • Broader Modalities: Incorporating audio, speech, 3D, and embodied signals.
  • Streaming Interaction: Enabling real-time multimodal interaction and closed-loop agents.