Lance: Unified Multimodal Modeling by Multi-Task Synergy - Summary
Summary (Overview)
- Core Contribution: Presents Lance, a lightweight (3B activated parameters) native unified multimodal model that supports the full spectrum of image and video tasks—understanding, generation, and editing—within a single framework.
- Key Principles: Built on unified context learning (shared interleaved multimodal sequences) and decoupled capability pathways (specialized processing for understanding vs. generation).
- Architectural Innovation: Employs a dual-stream mixture-of-experts architecture initialized from Qwen2.5-VL, with separate experts for understanding (
LLM_UND) and generation (LLM_GEN), and introduces Modality-Aware Rotary Positional Encoding (MaPE) to mitigate interference between heterogeneous visual tokens. - Training Paradigm: Uses a staged multi-task training strategy (Pre-Training, Continual Training, Supervised Fine-Tuning, Reinforcement Learning) with adaptive data scheduling to harness cross-task synergy and promote transfer.
- Strong Performance: Achieves competitive or state-of-the-art results on image/video generation (GenEval, DPG-Bench, VBench) and understanding (MVBench) benchmarks, outperforming larger open-source unified models, all trained with a resource-efficient 128-GPU budget.
Introduction and Theoretical Foundation
The field of multimodal AI is moving towards native unified models that integrate understanding, reasoning, and generation. However, a fundamental challenge remains: the visual representation requirements for understanding (high-level semantic features aligned with language) and generation (low-level continuous representations preserving texture and dynamics) are inherently misaligned. Existing unified models often struggle to balance these needs or have limited task coverage, largely confined to text-image domains.
Lance is motivated by the observation that models with broader task coverage (see Table 1) are more likely to exhibit emergent generalization on unseen tasks. This suggests multi-task learning is not just capability aggregation but a mechanism for promoting cross-modal and cross-task transfer. Lance's design is grounded in two core principles to address the representational mismatch:
- Unified Context Learning: Enables different tasks to interact within a shared interleaved multimodal sequence.
- Decoupled Capability Pathways: Mitigates interference by allocating dedicated capacity and representations to understanding and generation.
Methodology
Overall Architecture
Lance employs a dual-expert architecture over a shared interleaved multimodal sequence (see Figure 6).
- Input Tokenization:
- Text: Embedded using Qwen2.5-VL's language embedding layer.
- Understanding Visual Inputs: Encoded by the Qwen2.5-VL ViT encoder to produce compact semantic visual tokens.
- Generation Visual Inputs: Encoded by the Wan2.2 3D causal VAE encoder into continuous VAE latent tokens (clean and noisy), projected via an MLP connector.
- Unified Sequence Formulation: The sample is represented as a sequence : where and .
- Attention: Uses generalized 3D causal attention—causal across segments, bidirectional within visual segments.
Decoupled Pathways and Objectives
- Understanding Expert (
LLM_UND): Processes text and semantic visual tokens. Its hidden states are mapped by an LM head and optimized with next-token prediction loss: - Generation Expert (
LLM_GEN): Processes VAE latent tokens. Its hidden states are projected and passed to a flow prediction head, optimized with: where , . - Overall Objective:
Modality-Aware Rotary Positional Encoding (MaPE)
Standard 3D-RoPE assigns positions based on spatiotemporal layout, creating ambiguity when multiple visual token groups (ViT semantic, clean VAE, noisy VAE) coexist. MaPE injects modality awareness by applying a modality-specific offset only along the temporal dimension:
This explicitly separates functional roles while preserving spatial layouts and temporal coherence within groups (see Figure 7).
Empirical Validation / Results
Experimental Setup
- Base Model: Implemented upon Qwen2.5-VL 3B.
- Visual Encoders: Qwen2.5-VL ViT (understanding), Wan2.2 3D causal VAE (generation).
- Training Stages: Detailed hyperparameters are provided in Table 2.
Table 2: Training Hyperparameters of Lance
| Hyperparameter | PT | CT | SFT | RL |
|---|---|---|---|---|
| Learning rate | ||||
| LR scheduler | Constant | Constant | Cosine | Constant |
| Training steps | 350k | 80k | .5k | 800 |
| # Seen training tokens | 1.5T | 300B | 72B | 0.5B |
| Max context window | 40k | 70k | 70k | 70k |
Main Results
1. Image Generation
- Quantitative (Table 5): On GenEval, Lance achieves an overall score of 0.90, matching the best among unified models, with strong performance on counting, colors, and position. On DPG-Bench, it obtains competitive results (84.67 overall), excelling in relation modeling.
- Qualitative (Figure 10): Lance generates higher-quality images with better text alignment and aesthetics compared to open-source unified baselines (Bagel, InternVL-U) and is comparable to the 20B Qwen-Image and commercial Nano Banana.
2. Video Generation
- Quantitative (Table 6): On VBench, Lance achieves the best Total Score (85.11) among unified models, leading in metrics like object grounding, spatial relations, and scene understanding.
- Qualitative (Figure 11): Generated videos show strong semantic fidelity, coherent motion, and accurate camera transition following.
3. Multimodal Editing
- Quantitative (Table 7): On GEdit-Bench, Lance achieves the best average score (7.30) among unified models, excelling in categories like background change, material modification, and subject removal.
- Qualitative (Figure 12): Demonstrates visually coherent image editing and temporally consistent video editing with natural motion dynamics.
4. Multimodal Understanding
- Quantitative (Table 8): On MVBench, Lance achieves the highest overall score (62.0) among unified models, a ~11.3% relative improvement over the second-best (Show-o2 7B).
- Qualitative (Figures 3 & 5): Handles diverse tasks including OCR, knowledge QA, multi-image motion analysis, and detailed video captioning.
Theoretical and Practical Implications
Theoretical Implications:
- Validates that multi-task synergy is a powerful mechanism for enhancing unified multimodal modeling, as joint training on diverse tasks leads to mutual reinforcement and improved performance even on base capabilities like generation.
- Demonstrates the effectiveness of the unified context + decoupled pathways design principle for balancing the conflicting requirements of understanding and generation.
- Shows that capable unified models covering the full image-video task space can be built in a resource-efficient manner (3B params, 128 GPUs), challenging the notion that such performance requires massive scaling.
Practical Implications:
- Lance provides a practical open-source foundation model for a wide range of multimodal applications—from content creation (generation/editing) to visual analysis (understanding)—within a single, lightweight model.
- The staged training paradigm and architectural innovations (MaPE, dual-experts) offer a blueprint for developing efficient unified multimodal systems.
- Strong performance on editing and subject-driven generation tasks highlights potential for controllable and customizable content creation tools.
Conclusion
Lance is a lightweight native unified multimodal model that successfully integrates image and video understanding, generation, and editing. Its key innovations—dual-stream mixture-of-experts architecture, Modality-Aware Rotary Positional Encoding (MaPE), and staged multi-task training—enable it to harness cross-task synergy and achieve strong performance across benchmarks. The work demonstrates that broad multi-task learning is crucial for advancing unified multimodal modeling and that efficient, capable unified models are feasible.
Future Directions:
- Post-training: Developing video-aware reward models for reinforcement learning.
- Model Scaling: Scaling capacity, expert count, and context length.
- Broader Modalities: Incorporating audio, speech, 3D, and embodied signals.
- Streaming Interaction: Enabling real-time multimodal interaction and closed-loop agents.