HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Summary (Overview)

Identifies a key problem: Vision-Language Models (VLMs) exhibit diverse and compounding failure modes (perception, reasoning, knowledge, hallucination errors) during long Chain-of-Thought (CoT) reasoning, which are not adequately exposed by typical RLVR training data lacking complex, visually-dependent reasoning chains.
Proposes a novel solution: Introduces HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data. Each query forms a logically dependent chain of hops where earlier hops establish the instances, sets, or conditions needed for later hops, forcing repeated visual re-grounding and culminating in a specific, unambiguous numerical answer suitable for RLVR.
Demonstrates broad generalization: Training VLMs (Qwen3.5-35B-A3B and Qwen3.5-397B-A17B) with this synthesized multi-hop data, alongside original RLVR data, yields improvements on 20 out of 24 diverse benchmarks spanning STEM, General VQA, Document Understanding, and Video Understanding, despite the data being benchmark-agnostic.
Validates the multi-hop structure: Ablation studies show that preserving the full multi-hop chain is crucial; using half-multi-hop or single-hop variants reduces performance, with average scores dropping from 70.4 to 66.7 and 64.3 respectively on five representative benchmarks.
Shows specific strengths: Gains are particularly pronounced in long-CoT reasoning, exceeding 50 accuracy points in the ultra-long-CoT regime, and the synthesized data covers a broad difficulty range suitable for models of different sizes.

Introduction and Theoretical Foundation

Background & Motivation: VLMs have shown strong multimodal capabilities but still struggle with fine-grained, multi-step vision-language reasoning. Analysis reveals that during long CoT reasoning, errors in perception, reasoning, knowledge, or hallucination can appear at intermediate steps and compound, leading to incorrect final answers. Existing vision-language RLVR training data often lacks complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses unaddressed.

"This observation suggests that relying on, or simply expanding, existing vision-language RLVR training data is insufficient; what is needed is training data that structurally forces the model to seek visual evidence at each step of long-CoT reasoning."

Theoretical Basis: The core idea is to synthesize data that mimics the fragile nature of long-CoT reasoning in practice, forcing the model to repeatedly return to the image, recover correct visual evidence at each step, and use it to determine the next step, thereby training robustness against error cascades.

Methodology

Multi-Hop Vision-Language Reasoning Definition

The target multi-hop queries are structured with three reasoning levels:

Level 1: Single-object perception (e.g., read text, identify color, shape, position).
Level 2: Multi-object perception (e.g., compare sizes, count objects, determine spatial relations).
Level 3: Multi-hop reasoning, chaining multiple Level 1 and Level 2 steps.

Within a Level 3 query, hops are linked via two complementary dimensions:

Perception-level hop: Changes the kind of perception (e.g., Level 1 → Level 2) while remaining grounded in instances/sets/conditions from earlier hops.
Instance-chain hop: Moves to a new instance along an explicit dependency chain (e.g., A → B → C), where the next instance can only be identified from earlier hops.

Each query must satisfy:

(i) Be Level 3.
(ii) Combine both hop types.
(iii) Hops form a logically dependent chain where earlier hops establish the prerequisites for later hops.
(iv) Terminate in a specific, unambiguous numerical answer for RLVR verification.

Data Synthesis Pipeline (HopChain Framework)

A scalable four-stage pipeline:

Category Identification: Use Qwen3-VL-235B-A22B-Thinking to enumerate semantic categories in an image.
Instance Segmentation: Use SAM3 to generate segmentation masks and bounding boxes for instances of identified categories.
Multi-Hop Query Generation: Use Qwen3-VL-235B-A22B-Thinking to generate multi-hop queries over combinations of 3–6 instances. The model receives the original image and cropped instance patches (used only at design time). Constraints ensure queries are answerable from the image alone, describe objects by spatial/visual attributes, have a numerical answer, and avoid references to segmentation data.
Ground-Truth Annotation & Difficulty Calibration:
- Ground-truth annotation: Four annotators independently solve each query. Queries are discarded if ambiguous; those where all four agree on a final numerical answer are retained.
- Difficulty calibration: Retained queries are evaluated on a weaker model (8 sampled responses). Queries where the weaker model achieves 100% accuracy are removed as "too easy."

Reinforcement Learning Setup

The synthesized multi-hop data is used for RLVR training using the Soft Adaptive Policy Optimization (SAPO) algorithm.

RLVR Objective for VLMs: The objective aims to maximize:

J(\pi) = \mathbb{E}_{(I, q, a) \sim D, o \sim \pi(\cdot|I,q)}[R(o, a)], \quad \text{where} \quad R(o, a) = \begin{cases} 1.0 & \text{if } \mathrm{is\_equivalent}(o, a), \\ 0.0 & \text{otherwise.} \end{cases}

Here, $I$ , $q$ , and $a$ denote the image, text query, and ground-truth answer from dataset $D$ , and $o$ is the response generated by policy $\pi$ .

SAPO Objective: SAPO substitutes hard clipping with a temperature-controlled soft gate:

J(\theta) = \mathbb{E}_{(I,q,a)\sim D, \{o_i\}_{i=1}^G \sim \pi_{\text{old}}(\cdot|I,q)} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} f_{i,t}(r_{i,t}(\theta)) \hat{A}_{i,t} \right],

where:

r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|I,q,o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}|I,q,o_{i,<t})}, \quad \hat{A}_{i,t} = \hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)},

f_{i,t}(x) = \sigma(\tau_{i,t}(x-1)) \cdot \frac{4}{\tau_{i,t}}, \quad \tau_{i,t} = \begin{cases} \tau_{\text{pos}}, & \text{if } \hat{A}_{i,t} > 0, \\ \tau_{\text{neg}}, & \text{otherwise.} \end{cases}

Here, $i,j$ are sample indices, $t$ is token index, $\theta$ and $\theta_{\text{old}}$ are current and old policy parameters, $\tau_{\text{pos}}$ and $\tau_{\text{neg}}$ are temperatures, $R_i$ is computed from Equation (2), and $\sigma(x) = \frac{1}{1+e^{-x}}$ is the sigmoid function.

Training Settings: Models are evaluated under three settings:

Before RLVR: Model after SFT, before RLVR.
RLVR w/o Multi-Hop: RLVR training on original RLVR data only.
RLVR w/ Multi-Hop: RLVR training on a mixture of original RLVR data and HopChain-synthesized multi-hop data.

Empirical Validation / Results

Main Benchmark Results

Experiments on Qwen3.5-35B-A3B and Qwen3.5-397B-A17B across 24 benchmarks in 4 categories.

Key Result: RLVR w/ Multi-Hop improves 20 out of 24 benchmarks on both models.

Table 1: Qwen3.5-35B-A3B Results

Benchmark	Before RLVR	RLVR w/o Multi-Hop	RLVR w/ Multi-Hop
STEM and Puzzle
MathVision	61.97	73.71	76.05
MMMU Pro	66.07	69.25	70.64
MMMU	75.89	78.89	78.33
Mathvista(mini)	82.80	85.50	85.00
BabyVision	14.95	21.91	22.68
ZEROBench	1	1	3
EMMA(mini)	41.88	53.00	58.00
LogicVista	63.85	74.66	75.56
General VQA
MMBench CN-DEV-V1.1	87.46	90.17	90.48
MMBench EN-DEV-V1.1	88.47	90.63	91.49
RealWorldQA	75.42	78.17	79.35
MMStar	75.20	78.53	78.60
HallusionBench	66.49	66.64	66.50
AI2D_TEST	88.44	90.87	91.29
ERQA	44.50	48.25	51.38
Text Recognition and Document Understanding
CharXiv	61.30	69.00	73.10
DocVQA_VAL	95.00	95.13	95.55
InfoVQA_VAL	86.81	87.44	90.17
Video Understanding
VideoMME w/o sub.	73.41	74.63	75.00
VideoMMMU	70.67	73.33	74.78
MMVUCOT	63.70	65.80	68.90
MVBench	69.18	69.95	70.73
LVBench	51.13	54.49	53.20
MLVU (M-Avg)	77.92	77.69	79.53

Table 2: Qwen3.5-397B-A17B Results

Benchmark	Before RLVR	RLVR w/o Multi-Hop	RLVR w/ Multi-Hop
STEM and Puzzle
MathVision	77.38	81.68	83.71
MMMU Pro	73.03	75.06	76.47
MMMU	79.78	81.67	82.89
Mathvista(mini)	87.50	88.30	89.00
BabyVision	17.01	28.61	32.22
ZEROBench	3	4	8
EMMA(mini)	58.13	66.25	69.00
LogicVista	75.62	80.69	81.59
General VQA
MMBench CN-DEV-V1.1	89.47	91.41	91.72
MMBench EN-DEV-V1.1	90.71	92.49	91.56
RealWorldQA	79.87	79.87	81.70
MMStar	80.00	81.73	80.67
HallusionBench	65.76	67.48	67.86
AI2D_TEST	91.45	92.81	92.97
ERQA	53.75	60.50	60.00
Text Recognition and Document Understanding
CharXiv	70.00	74.60	77.20
DocVQA_VAL	95.93	95.98	96.03
InfoVQA_VAL	90.42	90.83	92.20
Video Understanding
VideoMME w/o sub.	76.56	78.30	80.41
VideoMMMU	76.67	78.89	80.00
MMVUCOT	70.20	72.30	72.50
MVBench	69.63	73.03	73.31
LVBench	58.30	59.13	59.07
MLVU (M-Avg)	81.46	82.43	82.52

Qualitative Examples: Figure 3 shows cases where RLVR w/ Multi-Hop corrects failures of RLVR w/o Multi-Hop, such as miscounting dots, misjudging spatial relations, or misreading chart values.

Ablation on Hop Structure

To test the necessity of the full multi-hop chain, three training-query settings were compared on five representative benchmarks (MathVision, MMMU Pro, RealWorldQA, ERQA, VideoMMMU):

RLVR w/ Single Hop: Each training query reduced to only its final hop.
RLVR w/ Half-Multi-Hop: First half of chain removed, only latter half kept.
RLVR w/ Multi-Hop: Full query kept.

Result: The full multi-hop setting performs best consistently. Averaged across the five benchmarks:

RLVR w/ Multi-Hop: 70.4
RLVR w/ Half-Multi-Hop: 66.7
RLVR w/ Single Hop: 64.3

Analysis

Analysis by Reasoning Length: Figure 6 shows that the advantage of RLVR w/ Multi-Hop persists and is often larger in the ultra-long-response regime, supporting the claim that HopChain strengthens robust chained reasoning over long outputs.

Difficulty Coverage Across Model Scales: Figure 7 shows the distribution of query success rates (based on 8 independent sampled responses per query) for both models.

For Qwen3.5-35B-A3B: 71.34% of queries are "Partially Correct" (1-7/8 correct), 15.57% "All Correct", 13.10% "All Incorrect".
For Qwen3.5-397B-A17B: 51.49% "Partially Correct", 39.99% "All Correct", 8.52% "All Incorrect". This indicates the synthesized data spans a broad difficulty range suitable for RLVR training across different model sizes.

Error-Type Analysis: Figure 2 compares the error-type distribution of baseline failures (RLVR w/o Multi-Hop) with the distribution of errors corrected by RLVR w/ Multi-Hop. The distributions are broadly similar:

Baseline errors: Perception (largest), Reasoning, Knowledge, Hallucination, Other.
Corrected errors: Perception (still largest), Reasoning, Knowledge, Hallucination, Other. Figure 8 further shows subtype breakdowns for corrected Perception and Reasoning errors, covering chart/text misreads, object misidentification, spatial/counting errors, logic/math/temporal/causal errors. This indicates broad, generalizable improvements across diverse failure modes, not just a narrow patch.

Theoretical and Practical Implications

Addresses a fundamental weakness: Provides a method to systematically expose and train against the compounding failure modes inherent in long CoT reasoning for VLMs.
Scalable data synthesis framework: HopChain offers a pipeline to generate large-scale, benchmark-agnostic training data that forces repeated visual grounding and logically dependent reasoning chains, a valuable resource for improving VLM robustness.
Generalizable gains: Demonstrates that improving fundamental reasoning capabilities via a proxy task (multi-hop data) transfers broadly across diverse domains (STEM, VQA, documents, videos), suggesting a path towards more general and robust VLMs.
Practical training strategy: Augmenting existing RLVR data with HopChain-synthesized multi-hop data is a effective and composable way to boost VLM performance without benchmark-specific tuning.

Conclusion

HopChain effectively addresses the challenge of diverse and compounding failure modes in long CoT reasoning for VLMs by synthesizing multi-hop data that forces repeated visual re-grounding. Training with this data yields broad and generalizable gains across numerous benchmarks, with the full multi-hop structure being crucial for these gains. The improvements are particularly strong for long-CoT reasoning and cover a broad range of error types.

Future Directions: The current pipeline depends on successful instance segmentation (SAM3), excluding images with few or no segmentable objects. A natural next step is to reduce this dependency by introducing complementary data-construction routes for such images, while preserving the core design principle of chained visual grounding.