Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs - Summary

Summary (Overview)

Spatial-on-Demand Efficiency: Introduces AwaRes, a framework that processes a low-resolution global image view by default and uses tool-calling to request only the specific high-resolution image crops needed to answer a query, dramatically reducing computational costs.
Automatic Data Curation: Proposes a novel pipeline to generate training data without manual spatial annotations, using an LLM-as-a-Judge (LaaJ) to determine when crops are needed and an oracle grounding model to localize where to crop.
Two-Stage Training: Employs a cold-start Supervised Fine-Tuning (SFT) stage to teach the tool protocol, followed by multi-turn Group Relative Policy Optimization (GRPO) with a composite reward that optimizes for both answer correctness and crop usage efficiency.
Strong Performance-Efficiency Trade-off: Achieves accuracy nearly matching full high-resolution processing (80.3% vs. 80.46% on average) while using only 36% of the visual tokens, outperforming fixed-budget pruning and adaptive resolution escalation baselines.

Introduction and Theoretical Foundation

Vision-Language Models (VLMs) require high-resolution inputs for detail-sensitive tasks (e.g., document QA, chart understanding), but this leads to a massive increase in visual tokens and computational cost. Existing efficiency methods fall short: token pruning methods use irregular patterns that hinder deployment, and resolution escalation methods (e.g., VisionThink) retrieve the entire high-resolution image when needed, wasting computation on irrelevant regions.

The key insight is that the need for high fidelity is spatially sparse—many questions require fine detail in only a small portion of the image (e.g., a single chart value, a table cell). Therefore, determining where to look is as important as whether to look.

AwaRes (VLM that is Aware to Resolution) addresses this by implementing a spatial-on-demand inference framework. It uses a simple tool-calling interface: the model first sees a low-resolution image, and if more detail is needed, it requests specific high-resolution crops. This multi-turn structure is compatible with KV-caching, making it practical for deployment. The core challenge is training a Coupled-Decision Policy (CDP) that jointly decides (i) whether additional resolution is needed and (ii) where to acquire it by selecting a subset of crops.

Methodology

The method formalizes a multi-turn interaction and uses a two-stage training pipeline with automatically curated data.

3.1 Problem Setup

Given an image-question-answer triple $(I, q, a^*)$ , the model is first shown a low-resolution view $I_{low}$ . It then chooses an action based on a policy:

\pi_\theta(C | q, I_{low}), \quad C \subseteq \mathcal{C}

where $\mathcal{C}$ is a predefined set of crop candidates. The actions are:

Direct answer ( $C = \emptyset$ ): Produce answer $\hat{a}$ using only $(q, I_{low})$ .
Crop request + answer ( $C \neq \emptyset$ ): Emit a tool call for a subset $C_{req} \subseteq \mathcal{C}$ . The tool returns high-resolution crops $\{I_{c}^{high}\}_{c \in C_{req}}$ , which are appended to the context, and the model then produces $\hat{a}$ .

3.2 Data Curation: Automatic Supervision for Crop Requests

A three-stage pipeline creates training trajectories without manual annotations:

Resolution-Sufficiency Labeling (When to Crop): A base VLM $T$ generates answers from both low-res and full-res inputs: $\hat{a}_{low} = T(q, I_{low})$ , $\hat{a}_{full} = T(q, I)$ . An LLM-as-a-Judge (LaaJ) compares both predictions to the ground truth $a^*$ . If $\hat{a}_{low}$ is judged correct (or ties), label LR (no crop needed); otherwise, label HR.
Crop Target Construction (Where to Crop): For HR examples, an oracle grounding model $G$ (Qwen3-VL) localizes the evidence, producing a bounding box $b$ . This box is mapped to the discrete crop set $\mathcal{C}$ (four quadrants, center, four half-image regions, full-image). The target crop subset is: $C^* = \{ c \in \mathcal{C} \ | \ \text{IoU}(b, c) \geq \tau \}$ where $\tau = 0.5$ .
Supervised Tool-Use Trajectories: Creates two transcript types:
- Direct-answer (LR): Single-turn output of $a^*$ .
- Tool-call-then-answer (HR): First turn: tool call selecting $C^*$ . Second turn: output $a^*$ conditioned on $I_{low}$ and the retrieved crops $\{I_{c}^{high}\}_{c \in C^*}$ .

3.3 Cold-Start Supervised Reference Policy (SFT)

The model is fine-tuned on the mixture of trajectories from §3.2 to learn the tool protocol and produce a reference policy $\pi_{ref}$ . The loss is a weighted negative log-likelihood:

\mathcal{L}_{\text{SFT}}(\theta) = -\sum_{t=1}^{T} w_t \log \pi_\theta(y_t | h_t)

The tool-call turn weight $w_t$ is upweighted (e.g., to 5) to stabilize learning of the critical first-turn CDP decision.

3.4 Multi-Turn GRPO

To refine efficiency and correct over-requesting tendencies from SFT, Group Relative Policy Optimization (GRPO) is applied, initialized from $\pi_{ref}$ and regularized towards it via a KL penalty.

Reward Design: A scalar reward for a completed trajectory $\tau$ $τ$ is: $R(\tau) = R_{\text{ans}}(\hat{a}, a^*) - C_{\text{tool}}(C, y)$
- Answer Reward ( $R_{\text{ans}}$ ): Semantic correctness measured by cosine similarity between sentence-transformer embeddings of $\hat{a}$ and $a^*$ .
- Tool-Use Cost ( $C_{\text{tool}}$ ): Asymmetric penalty: $C_{\text{tool}}(C, y) = \begin{cases} \alpha_{\text{miss}} & \text{if } y = \text{HR and } C = \emptyset \text{ (missed tool-call)} \\ \alpha_{\text{use}} + \lambda \|C\| & \text{if } C \neq \emptyset \text{ (tool usage)} \\ 0 & \text{if } y = \text{LR and } C = \emptyset \end{cases}$ where $\|C\|$ is the total fraction of image area covered by selected crops. This encourages recall (penalizing missed calls) and efficiency (penalizing large crop areas).
GRPO Optimization: For each prompt $x$ , a group of $G$ trajectories $\{\tau_1, ..., \tau_G\}$ is sampled. The advantage for each is computed relative to the group: $\hat{A}_i = \frac{R(\tau_i) - \mu_G}{\sigma_G + \epsilon}$ The objective is a PPO-style clipped objective with KL regularization: $\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|\tau_i|} \sum_{t=1}^{|\tau_i|} \min\left( r_t^{(i)} \hat{A}_i, \ \text{clip}(r_t^{(i)}, 1-\epsilon, 1+\epsilon) \hat{A}_i \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]$ where $r_t^{(i)} = \frac{\pi_\theta(a_t^{(i)} | x, a_{<t}^{(i)})}{\pi_{\text{old}}(a_t^{(i)} | x, a_{<t}^{(i)})}$ .

Empirical Validation / Results

Evaluation is conducted on six benchmarks spanning document understanding and general visual QA. The base model is Qwen2.5-VL-7B-Instruct. Efficiency is measured by Retained Token Ratio (RTR): $RTR_i = T_i / T_{\text{full}}$ , where $T_i$ is tokens processed by AwaRes and $T_{\text{full}}$ is tokens for full-resolution processing.

Table 1: Main results across vision-language benchmarks.

Model	ChartQA Acc↑ (RTR↓)	DocVQA Acc↑ (RTR↓)	OCRBench Acc↑ (RTR↓)	POPE Acc↑ (RTR↓)	RealWorld Acc↑ (RTR↓)	V*Bench Acc↑ (RTR↓)	Average Acc↑ (RTR↓)
Qwen2.5-VL-7B (Full-Res)	79.80 (1.00)	94.00 (1.00)	81.10 (1.00)	87.87 (1.00)	68.80 (1.00)	71.20 (1.00)	80.46 (1.00)
Qwen2.5-VL-7B-LR	65.00 (0.25)	91.00 (0.25)	70.70 (0.25)	84.41 (0.25)	66.00 (0.25)	63.20 (0.25)	73.39 (0.25)
VisionThink	79.90 (1.15)	90.35 (0.32)	80.10 (0.83)	86.70 (0.34)	66.60 (0.55)	71.73 (0.49)	79.23 (0.61)
AwaRes	80.64 (0.32)	94.43 (0.28)	81.30 (0.42)	85.73 (0.27)	68.50 (0.43)	71.20 (0.42)	80.30 (0.36)

Key Findings:

Accuracy & Efficiency: AwaRes nearly matches full-resolution accuracy (80.30% vs. 80.46%) while using only 36% of the visual tokens on average.
Vs. Baselines: Outperforms fixed-budget pruning methods (e.g., VisionZip at 70% RTR has 76.47% accuracy) and the adaptive baseline VisionThink (79.23% accuracy, 0.61 RTR).
Latency: AwaRes achieves sub-second average latency across benchmarks, while VisionThink suffers from long reasoning traces (e.g., 4.3s vs. 0.6s on ChartQA).
Policy Evolution: The GRPO stage successfully corrects the SFT model's tendency to over-request crops, shifting the policy towards more selective tool use.

Ablation Studies confirmed the importance of:

LaaJ for Labeling: High agreement (96.88%) with an alternative judge (DeepSeek-V3.2), while ANLS-based labeling degraded performance.
Tool-Turn Weighting ( $w_t=5$ ): Improved cold-start accuracy and tool-call reliability.
Two-Stage Training: GRPO-only training failed to learn effective tool use, while SFT-only led to over-cropping (high RTR). The combined approach was essential.
Reward Components: The asymmetric tool-cost and area penalty ( $\lambda$ ) were necessary to achieve low RTR.

Theoretical and Practical Implications

Theoretical: Demonstrates the feasibility and effectiveness of learning a spatially-aware, adaptive perception policy within a VLM. The CDP formulation treats the "when" and "where" decisions as inherently coupled, which is more aligned with the task structure than separate modules.
Practical: Provides a deployment-friendly efficiency solution. The tool-calling interface and multi-turn KV-cache reuse integrate smoothly with existing inference stacks (e.g., vLLM). The automatic data curation pipeline removes the need for costly manual spatial annotations, enabling scalable training. AwaRes offers a direct path to maintaining high-detail VLM capabilities under tight compute and latency budgets.

Conclusion

AwaRes presents a spatial-on-demand inference framework that resolves the accuracy-efficiency trade-off in high-resolution VLM inference by selectively retrieving only necessary high-resolution crops via tool-calling. Trained with an automatic curation pipeline and a two-stage SFT+GRPO approach, it matches full-resolution performance at a fraction of the cost.

Future Directions:

Extending crop selection from a discrete set to continuous bounding box predictions.
Generalizing the approach to video understanding, exploiting temporal sparsity.
Exploring progressive multi-step perception strategies that allocate resolution dynamically.