Lance: Unified Multimodal Modeling by Multi-Task Synergy - Summary

Summary (Overview)

Core Contribution: Presents Lance, a lightweight (3B activated parameters) native unified multimodal model that supports the full spectrum of image and video tasks—understanding, generation, and editing—within a single framework.
Key Principles: Built on unified context learning (shared interleaved multimodal sequences) and decoupled capability pathways (specialized processing for understanding vs. generation).
Architectural Innovation: Employs a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL, with separate experts for understanding (LLM_UND) and generation (LLM_GEN), and introduces Modality-Aware Rotary Positional Encoding (MaPE) to mitigate interference between heterogeneous visual tokens.
Training Paradigm: Uses a staged multi-task training strategy (Pre-Training, Continual Training, Supervised Fine-Tuning, Reinforcement Learning) with adaptive data scheduling to harness cross-task synergy and promote transfer.
Strong Performance: Achieves competitive or state-of-the-art results on image/video generation (GenEval, DPG-Bench, VBench) and understanding (MVBench) benchmarks, outperforming larger open-source unified models, all trained with a resource-efficient 128-GPU budget.

Introduction and Theoretical Foundation

The field of multimodal AI is moving towards native unified models that integrate understanding, reasoning, and generation. However, a fundamental challenge remains: the visual representation requirements for understanding (high-level semantic features aligned with language) and generation (low-level continuous representations preserving texture and dynamics) are inherently misaligned. Existing unified models often struggle to balance these needs or have limited task coverage, largely confined to text-image domains.

Lance is motivated by the observation that models with broader task coverage (see Table 1) are more likely to exhibit emergent generalization on unseen tasks. This suggests multi-task learning is not just capability aggregation but a mechanism for promoting cross-modal and cross-task transfer. Lance's design is grounded in two core principles to address the representational mismatch:

Unified Context Learning: Enables different tasks to interact within a shared interleaved multimodal sequence.
Decoupled Capability Pathways: Mitigates interference by allocating dedicated capacity and representations to understanding and generation.

Methodology

Overall Architecture

Lance employs a dual-expert architecture over a shared interleaved multimodal sequence (see Figure 6).

Input Tokenization:
- Text: Embedded using Qwen2.5-VL's language embedding layer.
- Understanding Visual Inputs: Encoded by the Qwen2.5-VL ViT encoder to produce compact semantic visual tokens.
- Generation Visual Inputs: Encoded by the Wan2.2 3D causal VAE encoder into continuous VAE latent tokens (clean and noisy), projected via an MLP connector.
Unified Sequence Formulation: The sample is represented as a sequence $S$ : $S = \cdots \oplus B_{\text{text}}(T) \oplus B_{\text{vis}}(V_{\text{vit}}) \oplus B_{\text{vis}}(V_{\text{vae}}^{\text{clean}}) \oplus B_{\text{vis}}(V_{\text{vae}}^{\text{noisy}}) \oplus B_{\text{text}}(T') \oplus \cdots$ where $B_{\text{text}}(T) = [\text{BOT}, T, \text{EOT}]$ and $B_{\text{vis}}(V) = [\text{BOV}, V, \text{EOV}]$ .
Attention: Uses generalized 3D causal attention—causal across segments, bidirectional within visual segments.

Decoupled Pathways and Objectives

Understanding Expert (LLM_UND): Processes text and semantic visual tokens. Its hidden states are mapped by an LM head and optimized with next-token prediction loss: $\mathcal{L}_{\text{UND}} = -\sum_i \log p_{\theta_{\text{UND}}}(y_i | y_{<i}, S)$
Generation Expert (LLM_GEN): Processes VAE latent tokens. Its hidden states are projected and passed to a flow prediction head, optimized with: $\mathcal{L}_{\text{GEN}} = \mathbb{E}_{x_0, x_1, t} \left[ \| v_{\theta_{\text{GEN}}}(x_t, S, t) - (x_1 - x_0) \|_2^2 \right]$ where $x_t = t x_1 + (1-t)x_0$ , $t \sim \mathcal{U}(0,1)$ .
Overall Objective: $\mathcal{L} = \lambda_u \mathcal{L}_{\text{UND}} + \lambda_g \mathcal{L}_{\text{GEN}}$

Modality-Aware Rotary Positional Encoding (MaPE)

Standard 3D-RoPE assigns positions based on spatiotemporal layout, creating ambiguity when multiple visual token groups (ViT semantic, clean VAE, noisy VAE) coexist. MaPE injects modality awareness by applying a modality-specific offset $\Delta_m$ only along the temporal dimension:

p^{(m)}_{t,h,w} = \hat{p}^{(m)}_{t,h,w} + [\Delta_m, 0, 0] = [\hat{t}^{(m)}_{t,h,w} + \Delta_m, \hat{h}^{(m)}_{t,h,w}, \hat{w}^{(m)}_{t,h,w}]

This explicitly separates functional roles while preserving spatial layouts and temporal coherence within groups (see Figure 7).

Empirical Validation / Results

Experimental Setup

Base Model: Implemented upon Qwen2.5-VL 3B.
Visual Encoders: Qwen2.5-VL ViT (understanding), Wan2.2 3D causal VAE (generation).
Training Stages: Detailed hyperparameters are provided in Table 2.

Table 2: Training Hyperparameters of Lance

Hyperparameter	PT	CT	SFT	RL
Learning rate	$1.0 \times 10^{-4}$	$1.0 \times 10^{-4}$	$2.5 \times 10^{-5}$	$2.0 \times 10^{-6}$
LR scheduler	Constant	Constant	Cosine	Constant
Training steps	350k	80k	.5k	800
# Seen training tokens	1.5T	300B	72B	0.5B
Max context window	40k	70k	70k	70k

Main Results

1. Image Generation

Quantitative (Table 5): On GenEval, Lance achieves an overall score of 0.90, matching the best among unified models, with strong performance on counting, colors, and position. On DPG-Bench, it obtains competitive results (84.67 overall), excelling in relation modeling.
Qualitative (Figure 10): Lance generates higher-quality images with better text alignment and aesthetics compared to open-source unified baselines (Bagel, InternVL-U) and is comparable to the 20B Qwen-Image and commercial Nano Banana.

2. Video Generation

Quantitative (Table 6): On VBench, Lance achieves the best Total Score (85.11) among unified models, leading in metrics like object grounding, spatial relations, and scene understanding.
Qualitative (Figure 11): Generated videos show strong semantic fidelity, coherent motion, and accurate camera transition following.

3. Multimodal Editing

Quantitative (Table 7): On GEdit-Bench, Lance achieves the best average score (7.30) among unified models, excelling in categories like background change, material modification, and subject removal.
Qualitative (Figure 12): Demonstrates visually coherent image editing and temporally consistent video editing with natural motion dynamics.

4. Multimodal Understanding

Quantitative (Table 8): On MVBench, Lance achieves the highest overall score (62.0) among unified models, a ~11.3% relative improvement over the second-best (Show-o2 7B).
Qualitative (Figures 3 & 5): Handles diverse tasks including OCR, knowledge QA, multi-image motion analysis, and detailed video captioning.

Theoretical and Practical Implications

Theoretical Implications:

Validates that multi-task synergy is a powerful mechanism for enhancing unified multimodal modeling, as joint training on diverse tasks leads to mutual reinforcement and improved performance even on base capabilities like generation.
Demonstrates the effectiveness of the unified context + decoupled pathways design principle for balancing the conflicting requirements of understanding and generation.
Shows that capable unified models covering the full image-video task space can be built in a resource-efficient manner (3B params, 128 GPUs), challenging the notion that such performance requires massive scaling.

Practical Implications:

Lance provides a practical open-source foundation model for a wide range of multimodal applications—from content creation (generation/editing) to visual analysis (understanding)—within a single, lightweight model.
The staged training paradigm and architectural innovations (MaPE, dual-experts) offer a blueprint for developing efficient unified multimodal systems.
Strong performance on editing and subject-driven generation tasks highlights potential for controllable and customizable content creation tools.

Conclusion

Lance is a lightweight native unified multimodal model that successfully integrates image and video understanding, generation, and editing. Its key innovations—dual-stream mixture-of-experts architecture, Modality-Aware Rotary Positional Encoding (MaPE), and staged multi-task training—enable it to harness cross-task synergy and achieve strong performance across benchmarks. The work demonstrates that broad multi-task learning is crucial for advancing unified multimodal modeling and that efficient, capable unified models are feasible.

Future Directions:

Post-training: Developing video-aware reward models for reinforcement learning.
Model Scaling: Scaling capacity, expert count, and context length.
Broader Modalities: Incorporating audio, speech, 3D, and embodied signals.
Streaming Interaction: Enabling real-time multimodal interaction and closed-loop agents.