SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
Summary (Overview)
- Core Contribution: Introduces SenseNova-U1, a native unified multimodal paradigm built on the NEO-unify architecture. It treats multimodal understanding and generation as synergistic views of a single underlying process, eliminating the traditional dichotomy between them.
- Key Architectural Innovations: Employs a near-lossless visual interface (32x32 patch size) without pretrained vision encoders (VEs) or variational autoencoders (VAEs), a native Mixture-of-Transformers (MoT) backbone with separate understanding and generation streams, and a joint training objective combining autoregressive text loss and pixel-space flow matching.
- Model Variants: Launches two variants: SenseNova-U1-8B-MoT (dense 8B) and SenseNova-U1-A3B-MoT (MoE, 30B total, 3B active).
- Empirical Performance: Demonstrates strong, competitive performance simultaneously across a wide range of understanding (text, VQA, reasoning, spatial intelligence) and generation (text-to-image, editing, infographics, interleaved) benchmarks, rivaling specialized top-tier models in each domain.
- Broader Implications: The unified architecture shows promising capabilities in vision-language-action (VLA) and world modeling (WM), pointing toward a future where multimodal AI emerges from a single, natively unified system rather than connected separate modules.
Introduction and Theoretical Foundation
Recent large vision-language models (VLMs) are fundamentally constrained by a persistent dichotomy between understanding and generation. Understanding typically relies on pretrained vision encoders (VEs), while generation uses latent variational autoencoders (VAEs). This leads to fragmented architectures, cascaded pipelines, and misaligned representation spaces, hindering the emergence of native multimodal intelligence.
The paper argues this divide is a structural limitation. It posits that multimodal intelligence can be unified in a truly native form by building a model that directly engages with native inputs (pixels and words), dispensing with both pretrained VEs and deep decoder heads. The goal is an end-to-end framework where understanding and generation co-evolve as synergistic views of a single process within a shared representation space.
The introduced SenseNova-U1 paradigm, built upon NEO-unify [112], is presented as a first step toward this vision. It incorporates:
- A near-lossless visual interface preserving semantics and pixel detail.
- Unified end-to-end modeling over raw inputs.
- A native Mixture-of-Transformers (MoT) architecture that synergizes understanding and generation.
Methodology
3.1 Near-Lossless Visual Interface
- Patch Encoding Layer: Input images/noise are mapped into visual tokens using two convolutional layers (strides 16 and 2), resulting in a 32 × 32 image patch per token. Text is encoded using the underlying LLM's tokenizer. Visual and textual tokens are projected into a shared embedding space.
- Patch Decoding Layer: The understanding stream uses a linear projection head for text prediction. The generation stream directly predicts pixel patches via an MLP head, bypassing diffusion heads and VAE decoders.
- Dynamic Noise Scale: To handle varying image resolutions, a resolution-adaptive noise scale is introduced to maintain consistent signal-to-noise ratio (SNR). where is the number of generation tokens and is a reference count.
- Noise-Scale Conditioning: The normalized scale is encoded and combined with the timestep embedding to form the conditioning signal .
3.2 Native Multimodal Unified Modeling
- Improved Native Primitive: Refines the native VLM primitive with Native Rotary Position Embedding (RoPE) that unifies temporal () and spatial (, ) encoding within a single representation.
- Native Mixture-of Transformers (MoT): The core backbone unifies understanding (clean image/text) and generation (noise-conditioned) streams within a monolithic framework. All modalities are in a single sequence processed under shared self-attention, but with full parameter decoupling between the two streams (separate projections, normalizations, FFN blocks).
- Model Variants: (Configurations in Table 1)
- SenseNova-U1-8B-MoT: Dense 8B networks in symmetric parallel configuration.
- SenseNova-U1-A3B-MoT: MoE framework. Understanding stream: 128 experts (30B total). Generation stream: 32 experts (8B total). Top- routing activates 8 experts per token (~3B active parameters).
Table 1: Architectural configurations of SenseNova-U1 variants
| Configuration | SenseNova-U1-8B-MoT | SenseNova-U1-A3B-MoT |
|---|---|---|
| Patch Size | 32 × 32 | 32 × 32 |
| Pre-Buffer | ✓ | ✗ |
| # Num Layers | 42 | 48 |
| # Num Heads (Q / KV) | 32 / 8 | 32 / 4 |
| Head Size (T / H / W) | 64 / 32 / 32 | 64 / 32 / 32 |
| Hidden Size | 4,096 | 2,048 |
| # Und / Gen Experts | 1 / 1 | 128 / 32 (A8) |
| # Und / Gen Parameters | 8.2B / 8.2B | 30.0B / 8.2B (A3B) |
3.3 Joint Training Objective
The model is optimized end-to-end with a weighted sum of understanding and generation losses:
- Autoregressive Text Loss: Standard next-token prediction.
- Pixel-Space Flow Matching: Follows JiT [69] with -predict and -loss. Noisy sample along rectified-flow: The framework regresses the clean signal , converted to a velocity term: Velocity-space MSE loss:
- Classifier-Free Guidance (CFG): Unified formulation modulating text () and visual context () guidance independently. Empirically, and work best.
3.4 Training Procedure
A progressive 6-stage training recipe (Table 2):
- Understanding Warmup: Initialize from pretrained NEO [30]. Two phases: Attention-Fusion (unify QK projections) and Full-Model Continuation.
- Generation Pre-Training: Freeze understanding branch, pretrain generation branch on text-to-image data in three phases (increasing resolution).
- Unified Mid-Training: Joint end-to-end training on a mixture of understanding and generation data ().
- Unified Supervised Fine-Tuning (SFT): Fine-tune on high-quality instruction-following data spanning all tasks.
- Post Training for T2I Generation: Uses reinforcement learning (Flow-GRPO [80]) with rewards for text rendering, style following, and aesthetics (HPSv3 [93]), and dynamic resolution warmup.
- CFG & Step Distillation: Employs Distribution Matching Distillation (DMD2 [157]) to reduce inference steps from 100 to 8.
3.5 Inference Infrastructure
- Disaggregated Deployment: Uses two specialized engines: LIGHT LLM for understanding/text streaming and LIGHT X2V for image generation. They exchange state via pinned shared memory, allowing independent optimization, resource allocation, and scaling.
- Hybrid Attention Kernel: Efficiently handles the hybrid attention pattern where text rows are causal, but image rows attend to the full text prefix and image span.
Empirical Validation / Results
5.1 Main Results
SenseNova-U1 is evaluated extensively across understanding and generation benchmarks.
5.1.1 Image Understanding
Table 3 shows strong performance on multimodal understanding benchmarks, rivaling or outperforming top models like Qwen3VL and Qwen3.5 across STEM/Reasoning (MMMU, MathVista), General VQA (MMBench), OCR (InfoVQA), and Spatial Intelligence (VSI-Bench, 3DSR-Bench).
5.1.2 Text Understanding
Table 4 shows strong knowledge (MMLU-Pro, C-Eval) and instruction-following (IFEval, IFBench) capabilities, and competitive agentic function performance on τ2-Bench and Claw-Eval.
5.1.3 Image Generation
- General Generation: Table 5 (GenEval) shows SenseNova-U1 achieves an overall score of 0.91, leading open-source models in compositional generation.
- Table 6 (DPG-Bench) shows competitive fine-grained instruction following, with the A3B variant achieving the highest Global score.
- Text-centric Generation: Table 11 (CVTG-2K) shows SenseNova-U1-8B achieves the best average word accuracy (0.940) for multi-region text rendering. Table 12 (LongText-Bench) shows strong long-text generation in both English and Chinese.
- Complex Infographic Generation: Table 13 (IGenBench) and Table 14 (BizGenEval) demonstrate leading performance among open-source models for challenging infographic and commercial visual generation.
- Reasoning-centric Generation: Table 15 (WISE) shows a clear advantage, indicating effective use of internal world knowledge and reasoning during image generation.
5.1.4 Image Editing
Table 16 (ImgEdit) and Table 17 (GEdit-Bench) show SenseNova-U1 achieves decent overall editing performance, competitive with specialized models. Table 18 (RISEBench) highlights strong reasoning-driven editing capability, especially with chain-of-thought (CoT), where SenseNova-U1-A3B-MoT-SFT (w/ CoT) reaches 30.0, best among open-source methods.
5.1.5 Interleaved Generation
- Table 19 (OpenING): SenseNova-U1-A3B-MoT-SFT (w/ CoT) achieves the best overall score (9.16), demonstrating high-quality open-ended interleaved image-text generation.
- Unified Reasoning: Table ory21 (Uni-MMMU GaU) and Table 22 (RealUnify) show SenseNova-U1 achieves genuine bidirectional synergy between understanding and generation, outperforming other unified models.
5.2 Ablation Studies
- Encoder-Free Design: Table 23 shows NEO-unify (2B) achieves high PSNR/SSIM in image reconstruction, proving the near-lossless interface preserves both semantic and pixel-level information.
- Understanding-Generation Synergy: Figure 12 shows the two capabilities co-evolve effectively within the MoT backbone with minimal conflict during joint training.
- Data-Scaling Efficiency: Figure 13 shows the model delivers strong data-scaling efficiency, with performance improving steadily with more training data.
Theoretical and Practical Implications
- Paradigm Shift: SenseNova-U1 challenges the prevailing fragmented approach to multimodal AI. It demonstrates that a single native architecture can internalize a coherent world abstraction, supporting both analytical (understanding) and creative (generation) intelligence within a shared latent space.
- Architectural Efficiency: Eliminating pretrained VEs/VAEs simplifies system design, reduces inductive biases, and improves computational efficiency. The MoT design enables efficient scaling while minimizing objective interference.
- Emergent Capabilities: Preliminary evidence in VLA and world modeling suggests the unified framework can support embodied, goal-directed intelligence, where perception, reasoning, and action arise natively across modalities without external adapters.
- Practical Applications: The model's strong performance in infographics, text-rich generation, interleaved content, and editing opens up applications in professional visual content creation, illustrated guides, visual storytelling, and other information-dense formats.
Conclusion
SenseNova-U1 sets a new paradigm for unified multimodal understanding and generation. Its strong, simultaneous performance across diverse benchmarks shows that a shared representation can support both analytical and creative intelligence. The work points toward a broader transition in AI: from aligning isolated modality-specific systems to learning perception, reasoning, and generation within a natively unified architecture. This paves the way for future models where multimodal intelligence emerges from a single, coherent underlying process, enabling more advanced capabilities like embodied reasoning and world modeling.
Official Resources: Demo: https://unify.light-ai.top/, Code: https://github.com/OpenSenseNova/SenseNova-U1, Model: https://huggingface.co/collections/sensenova/sensenova-u1, Blog: https://huggingface.co/blog/sensenova/neo-unify