From Pixels to Words – Towards Native One-Vision Models at Scale: NEO-ov

Summary (Overview)

  • Introduces NEO-ov, a fully native vision-language foundation model that learns cross-frame and pixel-word correspondence end-to-end, eliminating external encoders, auxiliary adapters, or post-hoc fusion.
  • Unifies diverse multimodal tasks within a single monolithic backbone, enabling fine-grained and unified spatiotemporal modeling for single-image, multi-image, video understanding, and spatial intelligence.
  • Demonstrates competitive performance, narrowing the gap to modular counterparts while excelling at fine-grained visual perception and spatial reasoning, validating the feasibility and competitiveness of native "one-vision" architectures at scale.
  • Provides systematic architectural analyses and detailed training recipes to facilitate subsequent native multimodal modeling.

Introduction and Theoretical Foundation

Current vision-language models (VLMs) typically adopt a modular encoder-decoder architecture, stitching together separate image/video encoders and language decoders via multi-stage alignment. This design imposes constraints on flexibility (forcing a dichotomy between static image and temporal video encoders, struggling with early pixel-word interaction), efficiency (decoupling fragments training and incurs alignment overhead, KV caching not applicable for long inputs), and scalability (entangling capacity balancing between vision encoders and LLMs).

Native VLMs (e.g., Fuyu, EVE, NEO) emerge as an alternative, jointly modeling visual and textual inputs within a single monolithic framework without explicit vision encoders. However, existing approaches remain constrained by distillation from static visual encoders or are focused on specific domains (e.g., video). NEO-ov aims to advance this direction by extending native modeling to a unified framework spanning single-image, multi-image, video inputs, and spatial intelligence, moving towards a general "one-vision" foundation architecture.

Methodology

Model Architecture

NEO-ov adopts a unified native vision-language backbone. Image I is encoded into visual tokens via a lightweight embedding layer:

xv=Conv2(GELU(Conv1(I))+PE),x_v = \text{Conv}_2(\text{GELU}(\text{Conv}_1(I)) + \text{PE}), xt=Tokenizer(T),x_t = \text{Tokenizer}(T),

where xvRnv×dx_v \in \mathbb{R}^{n_v \times d}, xtRnt×dx_t \in \mathbb{R}^{n_t \times d}, and PE\text{PE} denotes 2D RoPE embeddings. Conv_1 extracts patches with stride 16, Conv_2 aggregates with stride 2, producing one visual token per 32×3232 \times 32 image region. Visual tokens are wrapped with <img> and </img>, concatenated with text tokens, and processed by a unified backbone initialized from NEO and Qwen3.

Attention Mechanism

NEO-ov adopts an explicit THWT HW-decoupled attention design. For tokens ii and jj, Query (QQ) and Key (KK) features are:

qi=[qiT;qiH;qiW],kj=[kjT;kjH;kjW].q_i = [q_i^T; q_i^H; q_i^W], \quad k_j = [k_j^T; k_j^H; k_j^W].

Their correlation is:

sij=qiT,kjT+qiH,kjH+qiW,kjW.s_{ij} = \langle q_i^T, k_j^T \rangle + \langle q_i^H, k_j^H \rangle + \langle q_i^W, k_j^W \rangle.

The TT branch models textual order, cross-image relations, and cross-frame dependencies, while the HH and WW branches capture 2D spatial structure.

Native Rotary Position Embedding (Native-RoPE)

Implemented with separate temporal and spatial index modeling:

idxi=[ti,hi,wi],\text{idx}_i = [t_i, h_i, w_i],

where tit_i denotes temporal/sequential positions, and hih_i, wiw_i denote spatial coordinates. Text tokens retain only the temporal index (hi=wi=0h_i = w_i = 0). Image tokens share the same temporal index within each image and use hih_i and wiw_i to encode spatial positions.

Unified Visual Serialization

  • Single Image: One visual segment inserted at the corresponding <img> position.
  • Multi-image Inputs: Each <img> token replaced by an independent visual segment, following textual order:
Xmulti=[xt1,imgxv1/img,,xtm,imgxvm/img,q].X_{\text{multi}} = [x_{t_1}, \langle\text{img}\rangle x_{v_1} \langle/\text{img}\rangle, \dots, x_{t_m}, \langle\text{img}\rangle x_{v_m} \langle/\text{img}\rangle, q].
  • Video Inputs: Represented as a temporally ordered sequence of sampled frames with timestamps:
Xvideo=[pglobal,[τ1]:imgxv1/img,,[τf]:imgxvf/img,q].X_{\text{video}} = [p_{\text{global}}, [\tau_1]: \langle\text{img}\rangle x_{v_1} \langle/\text{img}\rangle, \dots, [\tau_f]: \langle\text{img}\rangle x_{v_f} \langle/\text{img}\rangle, q].

pglobalp_{\text{global}} denotes a global prefix encoding video duration, number of sampled frames, and sampling rate.

Unified Spatial-Temporal Attention

Treats each image or sampled frame as an independent visual unit. Tokens within the same visual unit attend bidirectionally, while interactions across different units remain autoregressive. Let uiu_i denote the visual unit index of token ii (ui=0u_i = 0 for text, ui>0u_i > 0 for visual). The attention mask is:

Mij=1    (ji)(ui=uj>0).M_{ij} = 1 \iff (j \le i) \lor (u_i = u_j > 0).

Training Procedure

Three progressive stages:

  1. Pre-Training: Aligns Pre-Buffer with post-LLM using ~20M image-text pairs. Optimizes patch embedding layers, pre-buffer layers, and new QK-related parameters.
  2. Mid-Training: Scales spatial-temporal reasoning on ~60M multimodal samples (resolutions 2562256^2 to 409624096^2, videos up to 128 frames). All layers optimized, context length extended from 16K to 36K tokens. Data mix: text-only, image-text, multi-image, video-text (~2:4:1:1 ratio).
  3. Supervised Fine-Tuning: Refines model using ~6M high-quality instruction-tuning data (single-image, multi-image, video). Entire model optimized end-to-end.

Implementation: Trained on sixteen 8-GPU nodes (80 GB GPUs). AdamW optimizer, cosine learning-rate decay, warm-up ratio 0.01. Peak learning rates: 2×1042\times10^{-4} (Stage 1), 5×1055\times10^{-5} (Stage 2 & 3). Uses Qwen3-1.7B and Qwen3-8B as language backbones. Pre-buffer: 12 layers for NEO-ov (2B), 6 layers for NEO-ov (9B). Native RoPE base frequencies: θT=1×106\theta_T = 1\times10^6, θH=θW=1×104\theta_H = \theta_W = 1\times10^4.

Empirical Validation / Results

Main Results

Evaluated using VLMEvalKit on image understanding, video understanding, and spatial intelligence benchmarks.

Image Understanding: Table 1 compares NEO-ov with modular and native VLMs on general VQA and OCR benchmarks.

ModelGeneral VQAUnderstandingOCR Recognization
Modular VLMs (Instruct-2B)
Qwen2-VL41.174.962.6
InternVL348.681.164.3
InternVL3.553.078.262.0
Qwen3-VL53.478.463.9
Native VLMs (Instruct-2B)
Mono-VL33.765.5
Mono-VL1.539.164.0
HoVLE32.273.3
OneCAT39.072.4
NEO48.676.063.1
NEO-ov54.780.064.4
Modular VLMs (Instruct-8B)
Qwen2.5-VL55.083.568.5
InternVL362.783.470.8
InternVL3.568.182.767.5
Qwen3-VL69.684.571.5
Native VLMs (Instruct-8B)
Fuyu27.910.743.7
EVE32.652.3
SOLO67.744.7
EVEv239.366.362.4
BREEN42.771.4
VoRA32.061.360.1
SAIL70.163.9
NEO54.682.167.3
NEO-ov68.185.167.8

Key Findings: NEO-ov establishes a new performance frontier for native VLMs at both 2B and 8B scales, surpassing prior native architectures. It demonstrates strong competitiveness against leading modular VLMs, matching or surpassing them on several reasoning and perception benchmarks, particularly in complex reasoning and hallucination suppression.

Multi-Image and Video Understanding: Table 2 shows comparisons on multi-image and video benchmarks.

ModelMulti-ImageVideo Understanding
Modular VLMs (Instruct-2B)
VideoLLaMA344.2
InternVL3.551.344.0
Qwen3-VL53.847.4
Native VLMs (Instruct-2B)
ELVA41.8
NEO-ov53.956.8
Modular VLMs (Instruct-8B)
LLaVA-Video63.3
VideoLLaMA356.7
InternVL3.559.555.8
Qwen3-VL69.164.4
Native VLMs (Instruct-8B)
Fuyu28.7
EVE29.3
ELVA47.1
NEO-ov62.858.2

Key Findings: NEO-ov achieves substantial gains over prior native VLMs and remains highly competitive with modular VLMs, highlighting its strong temporal reasoning and long-context visual understanding capabilities.

Spatial Intelligence: Table 3 compares performance on spatial intelligence benchmarks.

ModelVSI-BenchMMSIMindcubeViewSpatialSITE3DSREmbSpatialSPAROmni-Spatial
Spatial-specialist Models (Instruct-2B)
Cambrian-S (3B)56.127.038.441.031.041.463.533.041.9
Sensenova-SI63.734.241.852.736.850.562.838.026.4
General-purpose Models (Instruct-2B)
InternVL3.553.825.642.137.934.831.461.532.444.4
Qwen3-VL53.927.834.236.735.847.669.234.136.3
NEO-ov58.433.677.252.838.452.963.841.243.1
Spatial-specialist Models (Instruct-8B)
Cambrian-S67.525.839.640.933.045.072.837.941.9
Sensenova-SI68.843.385.754.747.755.572.045.833.0
GeoThinker72.630.983.045.955.951.978.868.240.1
General-purpose Models (Instruct-8B)
InternVL3.556.329.140.440.054.435.375.738.247.8
Qwen3-VL59.431.229.641.945.452.977.840.347.0
NEO-ov64.841.390.055.254.361.778.848.845.0

Key Findings: As a general-purpose native VLM, NEO-ov achieves comparable or better performance than spatial-specialist models and shows clear advantages over other general VLMs, highlighting its ability to capture fine-grained spatial and geometric representations.

Ablation Studies

  • Native Attention vs. Encoder-based Attention (Figure 4): The Pre-Buffer mechanism consistently achieves competitive or superior performance across all benchmarks (VQA, OCR, Video, SI), especially on OCR and SI tasks, suggesting preserving richer intermediate visual context through native interactions is more effective than compressed representations.
  • Deep Interactions Benefit Spatial Intelligence (Figure 5): NEO shows substantially larger gains from SI supervision than encoder-based models, attributed to its native interaction pattern where pixel-pixel and pixel-word interactions emerge directly in shallow layers.
  • Performance Improvements across Stages (Figure 6): Performance improves consistently from Stage 1 to Stage 2 for both 2B and 9B variants, with pronounced gains at smaller scales, indicating progressive training strengthens general visual understanding.

Theoretical and Practical Implications

  • Validates Native Architecture Feasibility: NEO-ov demonstrates that unified native architectures can achieve competitive performance against modular counterparts, challenging the necessity of specialized visual encoders for advanced multimodal tasks.
  • Unifies Multimodal Tasks: The model provides a single framework for diverse tasks (single-image, multi-image, video, spatial intelligence), suggesting a path towards general-purpose "one-vision" foundation models.
  • Enables Fine-Grained Perception: The native design, with early pixel-word and pixel-pixel interactions, facilitates richer spatial and cross-modal representations, leading to advantages in fine-grained perception and spatial reasoning tasks.
  • Offers Scalable Training Recipe: The three-stage training procedure (pre-training, mid-training, supervised fine-tuning) provides a blueprint for scaling native multimodal models.

Conclusion

NEO-ov is a fully native vision-language foundation model that unifies diverse multimodal tasks within a single monolithic backbone, learning visual perception, temporal dynamics, and cross-modal correspondence directly from raw inputs. Extensive experiments show it achieves competitive performance against strong encoder-based counterparts while showing clear advantages in fine-grained perception and spatial reasoning. The findings suggest unified native architectures provide a promising path toward scalable and general-purpose one-vision foundation models.

Limitations: A gap still exists with top-tier modular systems on certain benchmarks, attributed to data scale and quality. OCR-intensive and document-centric tasks remain relatively underexplored. The broader potential of native multimodal modeling requires further scaling in model capacity, data diversity, and long-context training.