Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs - Summary

Summary (Overview)

  • Spatial-on-Demand Efficiency: Introduces AwaRes, a framework that processes a low-resolution global image view by default and uses tool-calling to request only the specific high-resolution image crops needed to answer a query, dramatically reducing computational costs.
  • Automatic Data Curation: Proposes a novel pipeline to generate training data without manual spatial annotations, using an LLM-as-a-Judge (LaaJ) to determine when crops are needed and an oracle grounding model to localize where to crop.
  • Two-Stage Training: Employs a cold-start Supervised Fine-Tuning (SFT) stage to teach the tool protocol, followed by multi-turn Group Relative Policy Optimization (GRPO) with a composite reward that optimizes for both answer correctness and crop usage efficiency.
  • Strong Performance-Efficiency Trade-off: Achieves accuracy nearly matching full high-resolution processing (80.3% vs. 80.46% on average) while using only 36% of the visual tokens, outperforming fixed-budget pruning and adaptive resolution escalation baselines.

Introduction and Theoretical Foundation

Vision-Language Models (VLMs) require high-resolution inputs for detail-sensitive tasks (e.g., document QA, chart understanding), but this leads to a massive increase in visual tokens and computational cost. Existing efficiency methods fall short: token pruning methods use irregular patterns that hinder deployment, and resolution escalation methods (e.g., VisionThink) retrieve the entire high-resolution image when needed, wasting computation on irrelevant regions.

The key insight is that the need for high fidelity is spatially sparse—many questions require fine detail in only a small portion of the image (e.g., a single chart value, a table cell). Therefore, determining where to look is as important as whether to look.

AwaRes (VLM that is Aware to Resolution) addresses this by implementing a spatial-on-demand inference framework. It uses a simple tool-calling interface: the model first sees a low-resolution image, and if more detail is needed, it requests specific high-resolution crops. This multi-turn structure is compatible with KV-caching, making it practical for deployment. The core challenge is training a Coupled-Decision Policy (CDP) that jointly decides (i) whether additional resolution is needed and (ii) where to acquire it by selecting a subset of crops.

Methodology

The method formalizes a multi-turn interaction and uses a two-stage training pipeline with automatically curated data.

3.1 Problem Setup

Given an image-question-answer triple (I,q,a)(I, q, a^*), the model is first shown a low-resolution view IlowI_{low}. It then chooses an action based on a policy:

πθ(Cq,Ilow),CC\pi_\theta(C | q, I_{low}), \quad C \subseteq \mathcal{C}

where C\mathcal{C} is a predefined set of crop candidates. The actions are:

  1. Direct answer (C=C = \emptyset): Produce answer a^\hat{a} using only (q,Ilow)(q, I_{low}).
  2. Crop request + answer (CC \neq \emptyset): Emit a tool call for a subset CreqCC_{req} \subseteq \mathcal{C}. The tool returns high-resolution crops {Ichigh}cCreq\{I_{c}^{high}\}_{c \in C_{req}}, which are appended to the context, and the model then produces a^\hat{a}.

3.2 Data Curation: Automatic Supervision for Crop Requests

A three-stage pipeline creates training trajectories without manual annotations:

  1. Resolution-Sufficiency Labeling (When to Crop): A base VLM TT generates answers from both low-res and full-res inputs: a^low=T(q,Ilow)\hat{a}_{low} = T(q, I_{low}), a^full=T(q,I)\hat{a}_{full} = T(q, I). An LLM-as-a-Judge (LaaJ) compares both predictions to the ground truth aa^*. If a^low\hat{a}_{low} is judged correct (or ties), label LR (no crop needed); otherwise, label HR.
  2. Crop Target Construction (Where to Crop): For HR examples, an oracle grounding model GG (Qwen3-VL) localizes the evidence, producing a bounding box bb. This box is mapped to the discrete crop set C\mathcal{C} (four quadrants, center, four half-image regions, full-image). The target crop subset is: C={cC  IoU(b,c)τ}C^* = \{ c \in \mathcal{C} \ | \ \text{IoU}(b, c) \geq \tau \} where τ=0.5\tau = 0.5.
  3. Supervised Tool-Use Trajectories: Creates two transcript types:
    • Direct-answer (LR): Single-turn output of aa^*.
    • Tool-call-then-answer (HR): First turn: tool call selecting CC^*. Second turn: output aa^* conditioned on IlowI_{low} and the retrieved crops {Ichigh}cC\{I_{c}^{high}\}_{c \in C^*}.

3.3 Cold-Start Supervised Reference Policy (SFT)

The model is fine-tuned on the mixture of trajectories from §3.2 to learn the tool protocol and produce a reference policy πref\pi_{ref}. The loss is a weighted negative log-likelihood:

LSFT(θ)=t=1Twtlogπθ(ytht)\mathcal{L}_{\text{SFT}}(\theta) = -\sum_{t=1}^{T} w_t \log \pi_\theta(y_t | h_t)

The tool-call turn weight wtw_t is upweighted (e.g., to 5) to stabilize learning of the critical first-turn CDP decision.

3.4 Multi-Turn GRPO

To refine efficiency and correct over-requesting tendencies from SFT, Group Relative Policy Optimization (GRPO) is applied, initialized from πref\pi_{ref} and regularized towards it via a KL penalty.

  • Reward Design: A scalar reward for a completed trajectory τ\tau is: R(τ)=Rans(a^,a)Ctool(C,y)R(\tau) = R_{\text{ans}}(\hat{a}, a^*) - C_{\text{tool}}(C, y)
    • Answer Reward (RansR_{\text{ans}}): Semantic correctness measured by cosine similarity between sentence-transformer embeddings of a^\hat{a} and aa^*.
    • Tool-Use Cost (CtoolC_{\text{tool}}): Asymmetric penalty: Ctool(C,y)={αmissif y=HR and C= (missed tool-call)αuse+λCif C (tool usage)0if y=LR and C=C_{\text{tool}}(C, y) = \begin{cases} \alpha_{\text{miss}} & \text{if } y = \text{HR and } C = \emptyset \text{ (missed tool-call)} \\ \alpha_{\text{use}} + \lambda \|C\| & \text{if } C \neq \emptyset \text{ (tool usage)} \\ 0 & \text{if } y = \text{LR and } C = \emptyset \end{cases} where C\|C\| is the total fraction of image area covered by selected crops. This encourages recall (penalizing missed calls) and efficiency (penalizing large crop areas).
  • GRPO Optimization: For each prompt xx, a group of GG trajectories {τ1,...,τG}\{\tau_1, ..., \tau_G\} is sampled. The advantage for each is computed relative to the group: A^i=R(τi)μGσG+ϵ\hat{A}_i = \frac{R(\tau_i) - \mu_G}{\sigma_G + \epsilon} The objective is a PPO-style clipped objective with KL regularization: LGRPO(θ)=ExD[1Gi=1G1τit=1τimin(rt(i)A^i, clip(rt(i),1ϵ,1+ϵ)A^i)βDKL(πθπref)]\mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|\tau_i|} \sum_{t=1}^{|\tau_i|} \min\left( r_t^{(i)} \hat{A}_i, \ \text{clip}(r_t^{(i)}, 1-\epsilon, 1+\epsilon) \hat{A}_i \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right] where rt(i)=πθ(at(i)x,a<t(i))πold(at(i)x,a<t(i))r_t^{(i)} = \frac{\pi_\theta(a_t^{(i)} | x, a_{<t}^{(i)})}{\pi_{\text{old}}(a_t^{(i)} | x, a_{<t}^{(i)})}.

Empirical Validation / Results

Evaluation is conducted on six benchmarks spanning document understanding and general visual QA. The base model is Qwen2.5-VL-7B-Instruct. Efficiency is measured by Retained Token Ratio (RTR): RTRi=Ti/TfullRTR_i = T_i / T_{\text{full}}, where TiT_i is tokens processed by AwaRes and TfullT_{\text{full}} is tokens for full-resolution processing.

Table 1: Main results across vision-language benchmarks.

ModelChartQA Acc↑ (RTR↓)DocVQA Acc↑ (RTR↓)OCRBench Acc↑ (RTR↓)POPE Acc↑ (RTR↓)RealWorld Acc↑ (RTR↓)V*Bench Acc↑ (RTR↓)Average Acc↑ (RTR↓)
Qwen2.5-VL-7B (Full-Res)79.80 (1.00)94.00 (1.00)81.10 (1.00)87.87 (1.00)68.80 (1.00)71.20 (1.00)80.46 (1.00)
Qwen2.5-VL-7B-LR65.00 (0.25)91.00 (0.25)70.70 (0.25)84.41 (0.25)66.00 (0.25)63.20 (0.25)73.39 (0.25)
VisionThink79.90 (1.15)90.35 (0.32)80.10 (0.83)86.70 (0.34)66.60 (0.55)71.73 (0.49)79.23 (0.61)
AwaRes80.64 (0.32)94.43 (0.28)81.30 (0.42)85.73 (0.27)68.50 (0.43)71.20 (0.42)80.30 (0.36)

Key Findings:

  • Accuracy & Efficiency: AwaRes nearly matches full-resolution accuracy (80.30% vs. 80.46%) while using only 36% of the visual tokens on average.
  • Vs. Baselines: Outperforms fixed-budget pruning methods (e.g., VisionZip at 70% RTR has 76.47% accuracy) and the adaptive baseline VisionThink (79.23% accuracy, 0.61 RTR).
  • Latency: AwaRes achieves sub-second average latency across benchmarks, while VisionThink suffers from long reasoning traces (e.g., 4.3s vs. 0.6s on ChartQA).
  • Policy Evolution: The GRPO stage successfully corrects the SFT model's tendency to over-request crops, shifting the policy towards more selective tool use.

Ablation Studies confirmed the importance of:

  • LaaJ for Labeling: High agreement (96.88%) with an alternative judge (DeepSeek-V3.2), while ANLS-based labeling degraded performance.
  • Tool-Turn Weighting (wt=5w_t=5): Improved cold-start accuracy and tool-call reliability.
  • Two-Stage Training: GRPO-only training failed to learn effective tool use, while SFT-only led to over-cropping (high RTR). The combined approach was essential.
  • Reward Components: The asymmetric tool-cost and area penalty (λ\lambda) were necessary to achieve low RTR.

Theoretical and Practical Implications

  • Theoretical: Demonstrates the feasibility and effectiveness of learning a spatially-aware, adaptive perception policy within a VLM. The CDP formulation treats the "when" and "where" decisions as inherently coupled, which is more aligned with the task structure than separate modules.
  • Practical: Provides a deployment-friendly efficiency solution. The tool-calling interface and multi-turn KV-cache reuse integrate smoothly with existing inference stacks (e.g., vLLM). The automatic data curation pipeline removes the need for costly manual spatial annotations, enabling scalable training. AwaRes offers a direct path to maintaining high-detail VLM capabilities under tight compute and latency budgets.

Conclusion

AwaRes presents a spatial-on-demand inference framework that resolves the accuracy-efficiency trade-off in high-resolution VLM inference by selectively retrieving only necessary high-resolution crops via tool-calling. Trained with an automatic curation pipeline and a two-stage SFT+GRPO approach, it matches full-resolution performance at a fraction of the cost.

Future Directions:

  1. Extending crop selection from a discrete set to continuous bounding box predictions.
  2. Generalizing the approach to video understanding, exploiting temporal sparsity.
  3. Exploring progressive multi-step perception strategies that allocate resolution dynamically.