Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs - Summary
Summary (Overview)
- Spatial-on-Demand Efficiency: Introduces AwaRes, a framework that processes a low-resolution global image view by default and uses tool-calling to request only the specific high-resolution image crops needed to answer a query, dramatically reducing computational costs.
- Automatic Data Curation: Proposes a novel pipeline to generate training data without manual spatial annotations, using an LLM-as-a-Judge (LaaJ) to determine when crops are needed and an oracle grounding model to localize where to crop.
- Two-Stage Training: Employs a cold-start Supervised Fine-Tuning (SFT) stage to teach the tool protocol, followed by multi-turn Group Relative Policy Optimization (GRPO) with a composite reward that optimizes for both answer correctness and crop usage efficiency.
- Strong Performance-Efficiency Trade-off: Achieves accuracy nearly matching full high-resolution processing (80.3% vs. 80.46% on average) while using only 36% of the visual tokens, outperforming fixed-budget pruning and adaptive resolution escalation baselines.
Introduction and Theoretical Foundation
Vision-Language Models (VLMs) require high-resolution inputs for detail-sensitive tasks (e.g., document QA, chart understanding), but this leads to a massive increase in visual tokens and computational cost. Existing efficiency methods fall short: token pruning methods use irregular patterns that hinder deployment, and resolution escalation methods (e.g., VisionThink) retrieve the entire high-resolution image when needed, wasting computation on irrelevant regions.
The key insight is that the need for high fidelity is spatially sparse—many questions require fine detail in only a small portion of the image (e.g., a single chart value, a table cell). Therefore, determining where to look is as important as whether to look.
AwaRes (VLM that is Aware to Resolution) addresses this by implementing a spatial-on-demand inference framework. It uses a simple tool-calling interface: the model first sees a low-resolution image, and if more detail is needed, it requests specific high-resolution crops. This multi-turn structure is compatible with KV-caching, making it practical for deployment. The core challenge is training a Coupled-Decision Policy (CDP) that jointly decides (i) whether additional resolution is needed and (ii) where to acquire it by selecting a subset of crops.
Methodology
The method formalizes a multi-turn interaction and uses a two-stage training pipeline with automatically curated data.
3.1 Problem Setup
Given an image-question-answer triple , the model is first shown a low-resolution view . It then chooses an action based on a policy:
where is a predefined set of crop candidates. The actions are:
- Direct answer (): Produce answer using only .
- Crop request + answer (): Emit a tool call for a subset . The tool returns high-resolution crops , which are appended to the context, and the model then produces .
3.2 Data Curation: Automatic Supervision for Crop Requests
A three-stage pipeline creates training trajectories without manual annotations:
- Resolution-Sufficiency Labeling (When to Crop): A base VLM generates answers from both low-res and full-res inputs: , . An LLM-as-a-Judge (LaaJ) compares both predictions to the ground truth . If is judged correct (or ties), label LR (no crop needed); otherwise, label HR.
- Crop Target Construction (Where to Crop): For HR examples, an oracle grounding model (Qwen3-VL) localizes the evidence, producing a bounding box . This box is mapped to the discrete crop set (four quadrants, center, four half-image regions, full-image). The target crop subset is: where .
- Supervised Tool-Use Trajectories: Creates two transcript types:
- Direct-answer (LR): Single-turn output of .
- Tool-call-then-answer (HR): First turn: tool call selecting . Second turn: output conditioned on and the retrieved crops .
3.3 Cold-Start Supervised Reference Policy (SFT)
The model is fine-tuned on the mixture of trajectories from §3.2 to learn the tool protocol and produce a reference policy . The loss is a weighted negative log-likelihood:
The tool-call turn weight is upweighted (e.g., to 5) to stabilize learning of the critical first-turn CDP decision.
3.4 Multi-Turn GRPO
To refine efficiency and correct over-requesting tendencies from SFT, Group Relative Policy Optimization (GRPO) is applied, initialized from and regularized towards it via a KL penalty.
- Reward Design: A scalar reward for a completed trajectory is:
- Answer Reward (): Semantic correctness measured by cosine similarity between sentence-transformer embeddings of and .
- Tool-Use Cost (): Asymmetric penalty: where is the total fraction of image area covered by selected crops. This encourages recall (penalizing missed calls) and efficiency (penalizing large crop areas).
- GRPO Optimization: For each prompt , a group of trajectories is sampled. The advantage for each is computed relative to the group: The objective is a PPO-style clipped objective with KL regularization: where .
Empirical Validation / Results
Evaluation is conducted on six benchmarks spanning document understanding and general visual QA. The base model is Qwen2.5-VL-7B-Instruct. Efficiency is measured by Retained Token Ratio (RTR): , where is tokens processed by AwaRes and is tokens for full-resolution processing.
Table 1: Main results across vision-language benchmarks.
| Model | ChartQA Acc↑ (RTR↓) | DocVQA Acc↑ (RTR↓) | OCRBench Acc↑ (RTR↓) | POPE Acc↑ (RTR↓) | RealWorld Acc↑ (RTR↓) | V*Bench Acc↑ (RTR↓) | Average Acc↑ (RTR↓) |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B (Full-Res) | 79.80 (1.00) | 94.00 (1.00) | 81.10 (1.00) | 87.87 (1.00) | 68.80 (1.00) | 71.20 (1.00) | 80.46 (1.00) |
| Qwen2.5-VL-7B-LR | 65.00 (0.25) | 91.00 (0.25) | 70.70 (0.25) | 84.41 (0.25) | 66.00 (0.25) | 63.20 (0.25) | 73.39 (0.25) |
| VisionThink | 79.90 (1.15) | 90.35 (0.32) | 80.10 (0.83) | 86.70 (0.34) | 66.60 (0.55) | 71.73 (0.49) | 79.23 (0.61) |
| AwaRes | 80.64 (0.32) | 94.43 (0.28) | 81.30 (0.42) | 85.73 (0.27) | 68.50 (0.43) | 71.20 (0.42) | 80.30 (0.36) |
Key Findings:
- Accuracy & Efficiency: AwaRes nearly matches full-resolution accuracy (80.30% vs. 80.46%) while using only 36% of the visual tokens on average.
- Vs. Baselines: Outperforms fixed-budget pruning methods (e.g., VisionZip at 70% RTR has 76.47% accuracy) and the adaptive baseline VisionThink (79.23% accuracy, 0.61 RTR).
- Latency: AwaRes achieves sub-second average latency across benchmarks, while VisionThink suffers from long reasoning traces (e.g., 4.3s vs. 0.6s on ChartQA).
- Policy Evolution: The GRPO stage successfully corrects the SFT model's tendency to over-request crops, shifting the policy towards more selective tool use.
Ablation Studies confirmed the importance of:
- LaaJ for Labeling: High agreement (96.88%) with an alternative judge (DeepSeek-V3.2), while ANLS-based labeling degraded performance.
- Tool-Turn Weighting (): Improved cold-start accuracy and tool-call reliability.
- Two-Stage Training: GRPO-only training failed to learn effective tool use, while SFT-only led to over-cropping (high RTR). The combined approach was essential.
- Reward Components: The asymmetric tool-cost and area penalty () were necessary to achieve low RTR.
Theoretical and Practical Implications
- Theoretical: Demonstrates the feasibility and effectiveness of learning a spatially-aware, adaptive perception policy within a VLM. The CDP formulation treats the "when" and "where" decisions as inherently coupled, which is more aligned with the task structure than separate modules.
- Practical: Provides a deployment-friendly efficiency solution. The tool-calling interface and multi-turn KV-cache reuse integrate smoothly with existing inference stacks (e.g., vLLM). The automatic data curation pipeline removes the need for costly manual spatial annotations, enabling scalable training. AwaRes offers a direct path to maintaining high-detail VLM capabilities under tight compute and latency budgets.
Conclusion
AwaRes presents a spatial-on-demand inference framework that resolves the accuracy-efficiency trade-off in high-resolution VLM inference by selectively retrieving only necessary high-resolution crops via tool-calling. Trained with an automatic curation pipeline and a two-stage SFT+GRPO approach, it matches full-resolution performance at a fraction of the cost.
Future Directions:
- Extending crop selection from a discrete set to continuous bounding box predictions.
- Generalizing the approach to video understanding, exploiting temporal sparsity.
- Exploring progressive multi-step perception strategies that allocate resolution dynamically.