Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Summary (Overview)

Long-Document VQA is the most effective data source for Long-Context Continued Pre-Training (LongPT), outperforming OCR transcription tasks and improving long-document VQA scores by 5-6% with a 5B-token budget.
A balanced sequence-length distribution ("pool-native") is superior to a "long-biased" one focused on near-maximum lengths (128K), indicating that long-context ability requires generalizable key-information retrieval across various lengths and positions.
Information retrieval is the primary bottleneck, favoring training mixtures heavy on extraction tasks (80%) with a modest amount of reasoning data (20%) for task diversity.
Instruction-formatted long-document VQA data largely preserves short-context capabilities, reducing the need for mixing short-context data during LongPT.
The resulting model, MMProLong, extends Qwen2.5-VL-7B from 32K to 128K context, improves long-document VQA by 7.1%, and generalizes effectively to longer contexts (256K, 512K) and other multimodal tasks (webpage retrieval, video understanding, vision-text compression) without task-specific training.

Introduction and Theoretical Foundation

Long-context modeling is a critical capability for modern Large Vision-Language Models (LVLMs), enabling applications like long-document understanding, video analysis, and multi-turn agentic workflows. While context windows are rapidly scaling (e.g., to 128K tokens), practical training recipes—particularly for designing and balancing long-context data mixtures—remain underexplored for LVLMs. This work presents a systematic study of Long-Context Continued Pre-Training (LongPT) for LVLMs, starting from the Qwen2.5-VL-7B model with a 32K native context and extending it to 128K. The research is motivated by the need to establish data-efficient and effective training methodologies, moving beyond limited details in existing technical reports. The theoretical foundation leverages long documents as a natural source of multimodal data, combining rich visual layouts with dense textual content to synthesize training tasks that teach models to retrieve and reason over long, interleaved image-text sequences.

Methodology

The study uses Qwen2.5-VL-7B as the base model, extending its context from 32K to 128K tokens. The core methodology involves systematic data curation and ablation studies.

1. Data Curation: A document pool of over 1.5 million PDFs (academic papers, books, manuals) is constructed. Pages are rendered to images (DPI=144), and an OCR expert model parses them into layout-aware text blocks. From this pool, two categories of training tasks are synthesized:

Long-Document VQA: A segment-level synthesis pipeline is used. A coherent document segment (8-15 pages) is sampled, a QA pair is generated by a teacher LVLM (Seed 2.0), and this pair is placed back into the full-document context to create a long-context training instance. Three task types are created:
- extract-single: Retrieve information from a single page.
- extract-multi: Aggregate information from multiple pages.
- reasoning: Perform numerical/logical operations over extracted information.
OCR Transcription: The model must transcribe text elements from rendered pages. Two variants are created:
- OCR-full: Transcribe all pages.
- OCR-needle: Transcribe only a small subset of pages (1-3) amidst distractors.

2. Training Setup & Ablations:

Base Model & Scaling: Qwen2.5-VL-7B is used, with its mRoPE base frequency scaled from $1 \times 10^6$ to $4 \times 10^6$ (following Dynamic-NTK) for the 128K context.
Fixed Budget: Each LongPT run uses a fixed budget of 5B tokens, a maximum sequence length of 131,072 tokens, and a global batch size of 4M tokens.
Key Ablation Variables: The study systematically ablates:
1. Task Category: Long-document VQA vs. OCR transcription.
2. Sequence-Length Distribution: "Pool-native" (natural sampling from 32-50 page docs) vs. "Long-biased" (83.9% samples ≥100K tokens).
3. Long-Context Data Mixture: Ratios of information extraction (extract-single + extract-multi) to reasoning tasks.
4. Short-Context Data Mixing: Proportion of short-context data (from LLaVA-OneVision) mixed into LongPT.

3. Final Recipe (MMProLong): Based on ablation results, the final LongPT recipe for MMProLong is defined as:

Data: Long-document VQA only.
Distribution: Pool-native.
Mixture: 8:2 extraction-to-reasoning ratio (40% extract-single, 40% extract-multi, 20% reasoning).
Short Data: 0% (pure long-context data).

Empirical Validation / Results

1. Long-Document VQA vs. OCR Transcription: Table 1 shows that long-document VQA tasks are substantially more effective than OCR transcription for LongPT.

Table 1: Comparison of Long-Document VQA and OCR Transcription Data (Average scores over 64K & 128K MMLongBench).

Training Data	64K AVG.	128K AVG.	Overall AVG.	Δ vs. Base
Qwen2.5-VL-7B (Base)	52.24	48.94	50.59	-
`extract-single`	56.86	54.53	55.69	+5.1
`extract-multi`	58.02	55.77	56.90	+6.3
`reasoning`	57.33	55.62	56.47	+5.9
`OCR-full`	31.24	35.11	33.17	-17.4
`OCR-needle`	45.61	42.00	43.80	-6.8

2. Sequence-Length Distribution: The pool-native distribution consistently matches or outperforms the long-biased distribution across all three VQA tasks (Figure 2), suggesting that diverse length exposure is key for generalizable retrieval.

3. Long-Context Data Mixture: Grid search over extraction-to-reasoning ratios (Table 2) shows that moderately extraction-heavy mixtures (6:4 and 8:2) perform best, with 8:2 achieving the highest overall score (57.70). This indicates retrieval is the primary bottleneck.

Table 2: Long-Context Data Mixture Test (Extraction:Reasoning Ratio).

Ratio	64K AVG.	128K AVG.	Overall AVG.
0:10	57.33	55.62	56.47
2:8	58.02	54.24	56.13
4:6	56.35	55.11	55.73
6:4	58.79	55.75	57.27
8:2	59.56	55.84	57.70
10:0	57.49	56.40	56.94

4. Short-Context Performance Preservation: Pure long-context training (0% short data) achieves the best long-document VQA score (57.70) with only a mild drop in short-context average (from 66.47 to 65.48). Adding short data introduces a trade-off, improving short-context scores but reducing long-context performance (Figure 3, Table 3).

5. MMProLong Performance:

Long-Document VQA: MMProLong achieves an overall average of 57.70 on MMLongBench, a +7.11% improvement over the base Qwen2.5-VL-7B (50.59). It outperforms many larger open-source models (<15B) (Table 4).
Generalization to Longer Contexts: Without additional training, MMProLong maintains strong performance at 256K (55.09) and 512K (52.52) contexts, significantly outperforming the base model which degrades sharply (Table 5).

Table14: Generalization to 256K and 512K Contexts (Average scores over MMLongBench-Doc, LongDocURL, SlideVQA).

Model	256K AVG.	512K AVG.	Overall AVG.
MMProLong	55.09	52.52	53.80
Qwen2.5-VL-7B	38.12	19.49	28.80
Gemma3-12B	47.37	23.51	35.44

Generalization to Other Tasks: MMProLong shows strong transfer to:
- MM-NIAH (Webpage Needle Retrieval): Average score improves from 20.0 to 49.4 (Figure 4).
- Long-Video Understanding: Improvements on Video-MME, MLVU, and LongVideoBench (Figure 5).
- VTCBench (Vision-Text Compression): Overall score improves from 48.23 to 52.73.
Recipe Transferability: Applying the same recipe to Qwen3-VL-8B also yields improvements on long-document VQA and MM-NIAH, indicating the recipe is not backbone-specific.

Theoretical and Practical Implications

Theoretical Implications:

Long-context capability in LVLMs is not a discrete skill acquired only at a target length but a continuous, generalizable retrieval ability calibrated across diverse lengths and positions.
Instruction-formatted, task-diverse supervision (like VQA) is more effective for teaching this capability than dense alignment tasks (like OCR), as it directly trains the model to locate and use information within long contexts.
The primary challenge in scaling context is key-information retrieval, not necessarily complex reasoning over retrieved content.

Practical Implications:

Provides a data-efficient, practical recipe (LongPT) for extending LVLM context windows with a modest token budget (5B).
Demonstrates that high-quality long-document VQA data can preserve short-context abilities, reducing the need for costly short-data mixing and simplifying training pipelines.
Shows that long-context ability trained on documents transfers to other modalities (videos, webpages), suggesting the learned capability is general.
Offers actionable design principles for data mixture (retrieval-heavy), length distribution (balanced), and task selection (VQA over OCR).

Conclusion

This work establishes a systematic foundation for Long-Context Continued Pre-Training in LVLMs. The key findings are that long-document VQA is a highly effective data source, a balanced length distribution promotes generalization, and retrieval-focused mixtures are optimal. The instantiated model, MMProLong, validates these principles, achieving strong performance within and beyond its training context and generalizing to diverse long-context tasks. The proposed LongPT recipe offers a practical, data-efficient path for building reliable long-context capability in future LVLMs. Limitations include the study's focus on 7B/8B models and the cost of model-based evaluation. Future work should explore scaling these findings to larger models and even longer contexts, and develop more efficient evaluation protocols.