NMM Roadmap: Toward Native Multimodal Modeling

Summary (Overview)

Formalizes Native Multimodal Modeling (NMM): The paper establishes a formal taxonomy for NMM, distinguishing between mid-fusion (integrated but modality-aware) and early-fusion (born-native, unified transformer) architectures, and categorizing models by input-output duality into Multi-to-Text (M2T), Multi-to-Target (M2G), and Multi-to-Multi (M2M).
Provides a Comprehensive Industrial Roadmap: It delivers a systematic, end-to-end analysis of the NMM lifecycle, covering architectural coordination (§3), massive data curation (§4), full-stack training recipes (§5), inference & deployment challenges (§6), and holistic evaluation (§7).
Identifies Core Technical Challenges and Solutions: For each model category (M2T, M2G, M2M), the paper unpacks key bottlenecks (e.g., token explosion, semantic-acoustic conflict, modality competition) and summarizes the corresponding technical approaches from state-of-the-art models.
Compiles Extensive Empirical Data: It presents a comprehensive comparison table of recent NMMs (Table 1), a hierarchical taxonomy of technical challenges (Figure 4), and a detailed categorization of training datasets (Table 2).
Outlines Future Directions: The paper concludes with a visionary outlook on architectural convergence, data generation, training recipes, inference system co-design, and evaluation protocols needed to advance toward truly native world models.

Introduction and Theoretical Foundation

Multimodal modeling is a pivotal step from modality-agnostic large language models (LLMs) toward holistic world models that can perceive and interact with the real world through rich sensory signals. Early approaches relied on late-fusion paradigms (e.g., LLaVA), which modularly assemble pre-trained encoders with frozen language backbones via shallow projectors. This architecture suffers from a fundamental blindness to raw sensory signals and limits deep cross-modal interaction.

In response, the field is shifting toward Native Multimodal Modeling (NMM), where multiple modalities are intrinsically integrated into the core architecture for superior synergy and performance. However, the design space for NMM remains fragmented. This paper provides a formalized roadmap to clarify this transition.

Theoretical Formalization: The paper formally defines architectural nativity, excluding late-fusion from the NMM scope. Let the input modality set be $M = \{ m_1, m_2, \ldots, m_n \}$ , with $E_i$ as modality-specific encoders, $P_i$ as projection layers, and $T$ as a unified tokenization operator.

Late-Fusion (Non-Native): $F_{\text{late}} = G_{\text{LLM}} \left( \{P_i(E_i(m_i)) \}_{i=1}^n \right)$ , where the backbone remains blind to raw signals and relies on a grafted output head $G$ .
Mid-Fusion (First Native Stage): $F_{\text{mid}} = \text{Backbone}(C(E_1(m_1), \ldots, E_n(m_n)))$ , where $C$ is a cross-modal alignment operator. Features are injected into a joint multimodal backbone, making the model insightful but still modality-aware.
Early-Fusion (Optimal Native Synergy): $F_{\text{early}} = \text{Transformer}(\bigcup_i T(m_i))$ . All modalities are mapped by a unified operator $T$ into a single, shared embedding space from the outset, achieving a born-native, deep synergy.

Furthermore, NMM systems are categorized by input-output duality into three functional paradigms:

Multi-to-Text (M2T): $F_{M2T}: M \rightarrow T$ , where $T \in M$ is text. Asymmetric comprehension.
Multi-to-Target (M2G): $F_{M2G}: M \rightarrow y_k$ , where $y_k \in M$ is a single target non-textual modality (e.g., video). Asymmetric generation.
Multi-to-Multi (M2M): $F_{M2M}: M_{\text{in}} \rightarrow M_{\text{out}}$ , where both input and output can contain arbitrary combinations of modalities. Symmetric modeling where understanding and generation coexist.

Methodology

The paper's methodology is a comprehensive survey and analysis based on a formal taxonomy. It systematically investigates the full lifecycle of NMM development:

Model Architecture Analysis (§3): Examines the three paradigms (M2T, M2G, M2M) through the lens of their core technical challenges and the solutions implemented by representative models (see Table 1 and Figure 4).
Data Curation Taxonomy (§4): Organizes the heterogeneous training data for NMM systems into four functional categories: Understanding-Oriented, Generation-Oriented, Interaction-Oriented, and Preference & Alignment Data, detailing representative datasets for each (see Table 2).
Training Strategy Decomposition (§5): Analyzes how training strategies (Pre-Training, Supervised Fine-Tuning, Reinforcement Learning, On-Policy Distillation) are intrinsically coupled with the fusion regime (late, mid, early), tracing the evolution of techniques like differential learning rates and modality-mixture scheduling.
Inference & Deployment Challenges (§6): Identifies and discusses solutions for key serving bottlenecks: sequence explosion in long-context inference, dual challenges of heterogeneity and scale, and real-time streaming/full-duplex deployment.
Evaluation Benchmark Consolidation (§7): Summarizes major evaluation benchmarks for image, audio, and video modalities, covering both understanding and generation tasks (see Table 3).

Empirical Validation / Results

The paper's empirical validation is presented through comprehensive comparisons, taxonomies, and analyses of state-of-the-art models and their performance characteristics.

Key Comparative Table:

Model Category	Model Name	Date	Params (Flagship)	Input Modalities	Output Modalities
Multi-to-Text (M2T)	MiniCPM-V-4.6 [16]	2026.05	1B	Text, Image, Video	Text
	Nemotron3-Nano-Omni [17]	2026.04	30B/3B	Text, Image, Audio, Video	Text
	Kimi K2.5 [21]	2026.01	1T/32B	Text, Image, Video	Text
	Qwen3-VL [5]	2025.09	235B/22B	Text, Image, Video	Text
Multi-to-Target (M2G)	HiDream-O1-Image [28]	2026.05	8B	Text, Image	Image
	OmniVoice [29]	2026.04	0.8B	Text, Image, Audio	Audio
	Kling-Omni [33]	2025.12	-	Text, Image, Video	Video
	HunyuanVideo-1.5 [34]	2025.12	8.3B	Text	Video
Multi-to-Multi (M2M)	Lance [38]	2026.05	3B	Text, Image, Video	Text, Image, Video
	TUNA-2 [40]	2026.04	7B	Text, Image	Text, Image
	Emu3.5* [12]	2025.10	34.1B	Text, Image, Video	Text, Image, Video
	Transfusion [49]	2024.08	7B	Text, Image	Text, Image

Table 1 (excerpt). Comprehensive comparison of recently released Native Multimodal Models. * indicates models employing the discrete unified scheme.

Technical Challenge Taxonomy: The paper extracts and characterizes core bottlenecks and solutions across architectural designs, as summarized in Figure 4. For example:

M2T - Video Comprehension: Challenge: Computational Explosion. Solution: Compression & Feature Aggregation (Kimi K2.5, GLM-5V-Turbo).
M2G - Video Generation: Challenge: Token Explosion. Solution: Extreme Spatiotemporal VAE Compression (LTX-2.3, Wan2.2).
M2M - Fully Discretized Unified: Challenge: Competition-Driven Latency. Solution: Architectural stabilizers like QK-Norm (Chameleon) and Discrete Diffusion Adaptation (Emu3.5).

Training Regime Analysis: The analysis shows that each fusion regime imposes a distinct training signature. A critical finding is that early-fusion PT requires mandatory stabilizers (e.g., z-loss, QK-Norm) that are preconditions for scaling, not mere optimizations. For instance, Chameleon's ablations show that without QK-Norm, the model diverges after ~20% of training. The z-loss regularization is expressed as:

10^{-5} \cdot \log^2 Z

where $Z$ is the softmax partition function, needed to keep logits bounded across the heterogeneous token distribution.

Theoretical and Practical Implications

Theoretical Implications:

Formalizes a Fragmented Field: The paper provides the first principled structural taxonomy for NMM based on integration depth (mid/early-fusion) and input-output duality (M2T/M2G/M2M), offering a common language and framework for evaluating architectural nativity.
Reveals Intrinsic Training-Architecture Coupling: It demonstrates that training strategies are not independent of architecture. Key dimensions like differential learning rates (mid-fusion) or modality-mixture scheduling (early-fusion) are architectural necessities, not stylistic choices.
Highlights the Path to World Models: By defining the pinnacle as symmetric M2M modeling within a unified transformer space, the roadmap charts a clear evolutionary trajectory from modular assembly toward native world models capable of "understanding and generation seamlessly coexist[ing]."

Practical Implications:

Industrial-Grade Development Guide: The end-to-end pipeline analysis from data curation to deployment provides a comprehensive handbook for engineers and researchers building production-ready NMM systems.
Informs Model Selection and Design: The taxonomy and challenge/solution analysis help practitioners select optimal architectures for specific downstream tasks (e.g., choosing an M2G model for video generation vs. an M2T model for dense reasoning).
Identifies Critical Research Directions: The future outlook section outlines concrete open problems in architecture, data, training, inference, and evaluation, guiding community efforts toward solving the most consequential bottlenecks (e.g., unifying AR and diffusion, self-generating data streams, system-algorithm co-design).
Advances Evaluation Practices: By advocating for symmetric M2M benchmarks, temporally-aware metrics, and efficiency-aware protocols, the paper pushes the community beyond static, accuracy-only evaluation toward holistic assessment of deployable, interactive agents.

Conclusion

The paper consolidates the rapid evolution of multimodal AI into a structured roadmap toward Native Multimodal Modeling (NMM). The main takeaways are:

The field is transitioning from late-fusion (non-native assembly) through mid-fusion (integrated but modality-aware) to early-fusion (born-native, unified transformer), with the ultimate goal being symmetric Multi-to-Multi (M2M) modeling.
This architectural shift necessitates corresponding innovations across the entire stack: data curricula must balance modalities and purposes; training recipes become regime-specific; inference systems must handle sequence explosion and real-time streaming; and evaluation must become holistic and temporally-aware.
Key open challenges include unifying understanding/generation objectives, scaling sparse modality-aware MoEs, generating cross-modal interaction data, and co-designing algorithms with streaming deployment systems.
The final vision is the development of native world models—unified backbones that perceive raw sensory streams, maintain persistent state, and act in continuous time, moving AI closer to genuine general intelligence grounded in the physical world.

The formalization, taxonomy, and analysis provided aim to serve as a foundational reference and catalyst for the next phase of research in unified, symmetric, and embodied multimodal intelligence.