LongCat-Next: Lexicalizing Modalities as Discrete Tokens - Summary

Summary (Overview)

Introduces the Discrete Native Autoregression (DiNA) Paradigm, a unified framework that extends next-token prediction to native multimodality by representing text, vision, and audio within a shared discrete token space, enabling a single autoregressive objective.
Proposes dNaViT (Discrete Native-Resolution Vision Transformer), a unified visual tokenizer that transforms continuous images into hierarchical discrete tokens using Semantic-and-Aligned Encoders (SAE) and Residual Vector Quantization (RVQ), supporting any-resolution understanding and generation with up to 28× compression.
Develops LongCat-Next, an industrial-strength native multimodal model built on a Mixture-of-Experts (MoE) backbone, which excels at visual understanding, image generation, and audio tasks (seeing, painting, talking) within a single cohesive framework.
Demonstrates strong empirical performance, showing that discrete modeling can overcome its perceived performance ceiling, achieving competitive results with specialized models on benchmarks like MathVista, OCRBench, and audio comprehension while reconciling the traditional conflict between understanding and generation.

Introduction and Theoretical Foundation

The success of Large Language Models (LLMs) is built on the Next Token Prediction (NTP) paradigm and discrete autoregressive modeling. However, most contemporary multimodal systems remain language-centric, treating non-linguistic modalities (vision, audio) as external, loosely-coupled attachments, leading to fragmented architectures.

This paper argues for moving beyond this "language-plus-auxiliary" paradigm toward native multimodal modeling, where all modalities are represented as interoperable token sequences governed by a single shared autoregressive objective. The core challenge is effectively representing continuous, high-dimensional perceptual signals within a discrete token space.

The authors identify a fundamental dual bottleneck in discrete visual modeling:

Capacity of visual representation
Information loss from discretization

To address this, they introduce the principle of Semantic Completeness: a discrete representation $z$ must serve as an approximately lossless proxy for the original signal $I$ , satisfying:

P(A | z, Q) \approx P(A | I, Q) \quad \text{(1)}

where $A$ is the response to inquiry $Q$ for task $T$ . This implies both Discriminative Invariance (preserving semantic attributes for understanding) and Generative Sufficiency (capturing essential semantics for faithful reconstruction/generation).

Methodology

The LongCat-Next system is built upon the Discrete Native Autoregression (DiNA) paradigm.

Model Architecture

The system uses a structural decomposition:

Modality-specific tokenizer and de-tokenizer pairs handle conversion between raw signals and discrete IDs.
A decoder-only, modality-agnostic MoE backbone (LongCat-Flash-Lite A3B, 68.5B total params) serves as a multi-task learner across modalities.

Vision Tokenizer: Discrete Native-Resolution Vision Transformer (dNaViT)

dNaViT is designed to function analogously to a language tokenizer for vision.

1. Semantic-and-Aligned Encoder (SAE): The SAE provides a semantically rich pre-quantization space. It is trained with a large-scale multi-aspect alignment objective:

L_{\text{SAE}} = \mathbb{E}_{(I, Q, A)}[-\log P(A | z_p, Q)] \quad \text{(2)}

where $z_p = E_{\text{sae}}(I)$ . Existing vision-language encoders (e.g., QwenViT) can be adopted as strong SAE approximations.

2. Tokenization via Residual Vector Quantization (RVQ): SAE features are projected and quantized hierarchically across $L$ levels to preserve information:

\begin{aligned} r_0 &= f_{\text{proj}}(z) \\ \hat{q}_l &= \text{VQ}(r_{l-1}) \\ r_l &= r_{l-1} - \hat{q}_l \\ \hat{z} &= \sum_{l=1}^{L} \hat{q}_l \end{aligned} \quad \text{(3)}

Codebook entries are updated via Exponential Moving Average (EMA):

e_k \leftarrow \frac{m_k}{N_k} \quad \text{(4)}

where $m_k$ is the embedding sum and $N_k$ the cluster size for entry $k$ . The quantization objective is:

L_{\text{quant}} = \lambda_c L_{\text{commit}} + \lambda_s L_{\text{semantic}} \quad \text{(5)}

3. De-tokenization: A pixel decoder (Vision Transformer) and a flow-matching refiner reconstruct images from discrete tokens. The decoder is trained with:

L_{\text{dec}} = \lambda_1 L_{\text{pixel}} + \lambda_2 L_{\text{percep}} + \lambda_3 L_{\text{align}} \quad \text{(6)}

The authors propose the concept of Intrinsic Information Recovery, arguing that the residual architecture of modern encoders inherently preserves a latent pathway for low-level signal propagation, enabling reconstruction even without explicit supervision. The final latent representation can be expressed as:

z_p = x_0 + \sum_{l=1}^{L} F_l(x_{l-1}) = x_0 + F_1(x_0) + \cdots + F_L(x_{L-1}) \quad \text{(7)}

Audio Tokenizer

Audio is processed similarly:

A Whisper encoder extracts features.
An 8-layer RVQ compresses waveforms into discrete tokens at 12.5 Hz.
A decoder and flow-matching refinement network reconstructs high-fidelity audio. The training objective is:

L_{\text{audio}} = \lambda_1 L_{\text{recon}} + \lambda_2 L_{\text{commit}} + \lambda_3 L_{\text{llm}} \quad \text{(8)}

Multimodality Component

End-to-End Multimodal Embedding: Visual and audio codebook embeddings are randomly initialized and learned jointly with the model.
Multimodality Head: A DepthTransformer decodes hidden states into multi-level tokens for modality-specific reconstruction.
Internal Linguistic Guidance for Audio: Introduces special tokens (AS, AE, TE) and a unified training paradigm with stochastic delays to enable both parallel (low-latency) and serial (high linguistic quality) text-guided speech generation.

Infrastructure: VHalf-based Pipeline Parallelism

A profile-guided, V-shaped pipeline schedule co-locates the embedding layer and modality-specific loss modules on the same device to mitigate load imbalance and eliminate cross-stage communication overhead, improving training efficiency.

Empirical Validation / Results

LongCat-Next (A3B model size, trained on >2T tokens) is evaluated extensively against unified models (e.g., Qwen3-Omni) and specialized models across vision, audio, and text.

Visual Understanding

The model achieves highly competitive performance across diverse benchmarks.

Table 1: Comparison on Vision Benchmarks (Selected Results)

Benchmark	LongCat-Next	Qwen3-Omni-A3B-Instruct	Gemini2.5-Flash-Lite	Qwen3-VL-A3B-Instruct (Specialist)
STEM & Reasoning
MMMU val	70.6	69.1*	74.9	74.2*
MathVista mini	83.1	75.9*	78.2	80.1*
MathVision	64.7	56.3*	61.9	60.2*
OCR & Document
OmniDocBench en ↓	0.152	0.289	0.240	0.183*
CharXiv RQ	60.1	42.8	60.0	48.9*
ChartQA	88.0	86.8*	79.0	86.8*
OCRBench	86.5	85.4*	84.8	90.3*
General
MMStar	69.3	68.5*	74.93	72.1
RealWorldQA	72.0	72.9	70.5	73.7

Note: Lower is better for OmniDocBench (error rate).

Visual Generation

The model demonstrates strong text-to-image capability, competing favorably with specialized models.

Table 2: Comparison with Specialized T2I Models (Selected Results)

Model	GenEval	DPG	LongText-EN	TIFF (Acc/Corr)
Emu-3.5	72.67	89.42*	97.60*	89.48 / 88.18*
Qwen-Image 2507	87.00*	88.32*	94.30*	86.10* / 86.80*
FLUX.1-dev	66.00*	84.00*	60.70*	71.10 / 71.80*
LongCat-Next	84.44	84.66	93.15	82.85 / 84.38

Audio

The model excels in automatic speech recognition (ASR), text-to-speech (TTS), audio understanding, and audio-to-text chat.

Table-Adjacent Key Results:

ASR: Achieves Word Error Rate (WER) of 1.63 on LibriSpeech test-clean and 1.47 on AISHELL1.
TTS: WER of 1.90 on SeedTTS zh and 1.89 on SeedTTS en.
Audio Understanding: Scores 76.40 on MMAU and 85.91 on VocalSound.
Outperforms models like Gemini 3.1 Flash-Lite preview and MiMo-Audio on several benchmarks.

Text

The model maintains robust foundational language capabilities, mitigating the "multimodal tax".

Agentic Tool Use: Scores 62.06 on Tau2-Telecom, significantly outperforming baselines.
Coding: 43.0 accuracy on SWE-Bench.
Knowledge: 83.95 on MMLU, 86.80 on C-Eval.

Key Ablation and Analysis Findings

Bridging the Discrete-Continuous Gap: With a Pre-Buffer module and sufficient data scaling, discrete representations can achieve near-parity with continuous baselines in understanding tasks.
Information Recovery: Randomly initialized ViTs can achieve high reconstruction fidelity (PSNR 30.52), suggesting intrinsic architectural properties aid recovery.
Understanding-Generation Synergy: Under DiNA, understanding and generation are two instances of the same predictive process. Training a unified model on a mixed dataset shows that understanding enhances generation without compromising itself.
Modality-Agnostic MoE Dynamics: Native multimodal training induces functional specialization of experts and more efficient capacity usage within the initially modality-agnostic MoE.
Platonic Representation Hypothesis: LongCat-Next exhibits interwoven embeddings across visual and textual tokens in a shared semantic space, unlike the separated clusters of non-native models.

Theoretical and Practical Implications

Paradigm Shift: DiNA provides a principled, infrastructure-friendly path toward truly native multimodal intelligence, aligning multimodal modeling with the mature ecosystem of LLMs.
Reconciling Objectives: It effectively unifies the traditionally competing goals of understanding and generation under a single autoregressive formulation, mitigating their practical conflict.
Scalability and Performance: Demonstrates that discrete tokenization is not an inherent performance limiter but a scalable, industrial-strength foundation capable of matching or surpassing specialized models.
Unified Representation: The work offers evidence for the Platonic Representation Hypothesis, suggesting that modalities can be internalized as different expressions of the same underlying concepts within a model's embedding space.
Open Source: Releasing the model and tokenizers fosters further research in unified multimodal modeling.

Conclusion

LongCat-Next represents a significant step toward native multimodality within a unified discrete autoregressive framework. By introducing the DiNA paradigm and the dNaViT tokenizer, the work shows that continuous perceptual signals can be effectively lexicalized into discrete tokens, enabling a single model to excel at seeing, painting, and talking.

The results validate that a natively discrete paradigm can overcome perceived bottlenecks, achieve competitive performance, and reconcile understanding with generation. This offers a promising alternative to fragmented, language-centric multimodal architectures, moving closer to a unified model of generalist multimodal intelligence.

Future Directions include optimizing the tokenizer further, extending to any-to-any generation and interleaved reasoning, and co-designing data and discretization strategies for improved representation learning.