# Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

> AwaRes enables VLMs to request only high-resolution image crops needed for a query, cutting visual tokens by 64% while matching full-resolution accuracy.

- **Source:** [arXiv](https://arxiv.org/abs/2603.16932)
- **Published:** 2026-03-25
- **Permalink:** https://picx.dev/p/kJJOVr
- **Whiteboard:** https://picx.dev/p/kJJOVr/image

## Summary

# Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs - Summary

## Summary (Overview)
*   **Spatial-on-Demand Efficiency:** Introduces **AwaRes**, a framework that processes a low-resolution global image view by default and uses **tool-calling** to request only the specific high-resolution image crops needed to answer a query, dramatically reducing computational costs.
*   **Automatic Data Curation:** Proposes a novel pipeline to generate training data without manual spatial annotations, using an **LLM-as-a-Judge (LaaJ)** to determine when crops are needed and an **oracle grounding model** to localize *where* to crop.
*   **Two-Stage Training:** Employs a **cold-start Supervised Fine-Tuning (SFT)** stage to teach the tool protocol, followed by **multi-turn Group Relative Policy Optimization (GRPO)** with a composite reward that optimizes for both answer correctness and crop usage efficiency.
*   **Strong Performance-Efficiency Trade-off:** Achieves accuracy nearly matching full high-resolution processing (**80.3% vs. 80.46%** on average) while using only **36%** of the visual tokens, outperforming fixed-budget pruning and adaptive resolution escalation baselines.

## Introduction and Theoretical Foundation
Vision-Language Models (VLMs) require high-resolution inputs for detail-sensitive tasks (e.g., document QA, chart understanding), but this leads to a massive increase in visual tokens and computational cost. Existing efficiency methods fall short: **token pruning** methods use irregular patterns that hinder deployment, and **resolution escalation** methods (e.g., VisionThink) retrieve the *entire* high-resolution image when needed, wasting computation on irrelevant regions.

The key insight is that the need for high fidelity is **spatially sparse**—many questions require fine detail in only a small portion of the image (e.g., a single chart value, a table cell). Therefore, determining *where* to look is as important as *whether* to look.

**AwaRes** (VLM that is **Awa**re to **Res**olution) addresses this by implementing a **spatial-on-demand** inference framework. It uses a simple tool-calling interface: the model first sees a low-resolution image, and if more detail is needed, it requests specific high-resolution crops. This multi-turn structure is compatible with **KV-caching**, making it practical for deployment. The core challenge is training a **Coupled-Decision Policy (CDP)** that jointly decides (i) *whether* additional resolution is needed and (ii) *where* to acquire it by selecting a subset of crops.

## Methodology
The method formalizes a multi-turn interaction and uses a two-stage training pipeline with automatically curated data.

### 3.1 Problem Setup
Given an image-question-answer triple $(I, q, a^*)$, the model is first shown a low-resolution view $I_{low}$. It then chooses an action based on a policy:
$$
\pi_\theta(C | q, I_{low}), \quad C \subseteq \mathcal{C}
$$
where $\mathcal{C}$ is a predefined set of crop candidates. The actions are:
1.  **Direct answer ($C = \emptyset$)**: Produce answer $\hat{a}$ using only $(q, I_{low})$.
2.  **Crop request + answer ($C \neq \emptyset$)**: Emit a tool call for a subset $C_{req} \subseteq \mathcal{C}$. The tool returns high-resolution crops $\{I_{c}^{high}\}_{c \in C_{req}}$, which are appended to the context, and the model then produces $\hat{a}$.

### 3.2 Data Curation: Automatic Supervision for Crop Requests
A three-stage pipeline creates training trajectories without manual annotations:
1.  **Resolution-Sufficiency Labeling (When to Crop)**: A base VLM $T$ generates answers from both low-res and full-res inputs: $\hat{a}_{low} = T(q, I_{low})$, $\hat{a}_{full} = T(q, I)$. An **LLM-as-a-Judge (LaaJ)** compares both predictions to the ground truth $a^*$. If $\hat{a}_{low}$ is judged correct (or ties), label **LR** (no crop needed); otherwise, label **HR**.
2.  **Crop Target Construction (Where to Crop)**: For **HR** examples, an oracle grounding model $G$ (Qwen3-VL) localizes the evidence, producing a bounding box $b$. This box is mapped to the discrete crop set $\mathcal{C}$ (four quadrants, center, four half-image regions, full-image). The target crop subset is:
    $$C^* = \{ c \in \mathcal{C} \ | \ \text{IoU}(b, c) \geq \tau \}$$
    where $\tau = 0.5$.
3.  **Supervised Tool-Use Trajectories**: Creates two transcript types:
    *   **Direct-answer (LR)**: Single-turn output of $a^*$.
    *   **Tool-call-then-answer (HR)**: First turn: tool call selecting $C^*$. Second turn: output $a^*$ conditioned on $I_{low}$ and the retrieved crops $\{I_{c}^{high}\}_{c \in C^*}$.

### 3.3 Cold-Start Supervised Reference Policy (SFT)
The model is fine-tuned on the mixture of trajectories from §3.2 to learn the tool protocol and produce a reference policy $\pi_{ref}$. The loss is a weighted negative log-likelihood:
$$
\mathcal{L}_{\text{SFT}}(\theta) = -\sum_{t=1}^{T} w_t \log \pi_\theta(y_t | h_t)
$$
The tool-call turn weight $w_t$ is upweighted (e.g., to 5) to stabilize learning of the critical first-turn CDP decision.

### 3.4 Multi-Turn GRPO
To refine efficiency and correct over-requesting tendencies from SFT, Group Relative Policy Optimization (GRPO) is applied, initialized from $\pi_{ref}$ and regularized towards it via a KL penalty.

*   **Reward Design**: A scalar reward for a completed trajectory $\tau$ is:
    $$R(\tau) = R_{\text{ans}}(\hat{a}, a^*) - C_{\text{tool}}(C, y)$$
    *   **Answer Reward ($R_{\text{ans}}$)**: Semantic correctness measured by cosine similarity between sentence-transformer embeddings of $\hat{a}$ and $a^*$.
    *   **Tool-Use Cost ($C_{\text{tool}}$)**: Asymmetric penalty:
        $$
        C_{\text{tool}}(C, y) = \begin{cases}
        \alpha_{\text{miss}} & \text{if } y = \text{HR and } C = \emptyset \text{ (missed tool-call)} \\
        \alpha_{\text{use}} + \lambda \|C\| & \text{if } C \neq \emptyset \text{ (tool usage)} \\
        0 & \text{if } y = \text{LR and } C = \emptyset
        \end{cases}
        $$
        where $\|C\|$ is the total fraction of image area covered by selected crops. This encourages recall (penalizing missed calls) and efficiency (penalizing large crop areas).
*   **GRPO Optimization**: For each prompt $x$, a group of $G$ trajectories $\{\tau_1, ..., \tau_G\}$ is sampled. The advantage for each is computed relative to the group:
    $$\hat{A}_i = \frac{R(\tau_i) - \mu_G}{\sigma_G + \epsilon}$$
    The objective is a PPO-style clipped objective with KL regularization:
    $$
    \mathcal{L}_{\text{GRPO}}(\theta) = \mathbb{E}_{x \sim \mathcal{D}} \left[ \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|\tau_i|} \sum_{t=1}^{|\tau_i|} \min\left( r_t^{(i)} \hat{A}_i, \ \text{clip}(r_t^{(i)}, 1-\epsilon, 1+\epsilon) \hat{A}_i \right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}}) \right]
    $$
    where $r_t^{(i)} = \frac{\pi_\theta(a_t^{(i)} | x, a_{<t}^{(i)})}{\pi_{\text{old}}(a_t^{(i)} | x, a_{<t}^{(i)})}$.

## Empirical Validation / Results
Evaluation is conducted on six benchmarks spanning document understanding and general visual QA. The base model is Qwen2.5-VL-7B-Instruct. Efficiency is measured by **Retained Token Ratio (RTR)**: $RTR_i = T_i / T_{\text{full}}$, where $T_i$ is tokens processed by AwaRes and $T_{\text{full}}$ is tokens for full-resolution processing.

**Table 1: Main results across vision-language benchmarks.**
| Model | ChartQA Acc↑ (RTR↓) | DocVQA Acc↑ (RTR↓) | OCRBench Acc↑ (RTR↓) | POPE Acc↑ (RTR↓) | RealWorld Acc↑ (RTR↓) | V*Bench Acc↑ (RTR↓) | **Average Acc↑ (RTR↓)** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Qwen2.5-VL-7B (Full-Res)** | 79.80 (1.00) | 94.00 (1.00) | 81.10 (1.00) | 87.87 (1.00) | 68.80 (1.00) | 71.20 (1.00) | **80.46 (1.00)** |
| Qwen2.5-VL-7B-LR | 65.00 (0.25) | 91.00 (0.25) | 70.70 (0.25) | 84.41 (0.25) | 66.00 (0.25) | 63.20 (0.25) | 73.39 (0.25) |
| VisionThink | 79.90 (1.15) | 90.35 (0.32) | 80.10 (0.83) | 86.70 (0.34) | 66.60 (0.55) | 71.73 (0.49) | 79.23 (0.61) |
| **AwaRes** | **80.64 (0.32)** | **94.43 (0.28)** | **81.30 (0.42)** | 85.73 (0.27) | **68.50 (0.43)** | **71.20 (0.42)** | **80.30 (0.36)** |

**Key Findings:**
*   **Accuracy & Efficiency:** AwaRes nearly matches full-resolution accuracy (80.30% vs. 80.46%) while using only **36%** of the visual tokens on average.
*   **Vs. Baselines:** Outperforms fixed-budget pruning methods (e.g., VisionZip at 70% RTR has 76.47% accuracy) and the adaptive baseline VisionThink (79.23% accuracy, 0.61 RTR).
*   **Latency:** AwaRes achieves **sub-second average latency** across benchmarks, while VisionThink suffers from long reasoning traces (e.g., 4.3s vs. 0.6s on ChartQA).
*   **Policy Evolution:** The GRPO stage successfully corrects the SFT model's tendency to over-request crops, shifting the policy towards more selective tool use.

**Ablation Studies** confirmed the importance of:
*   **LaaJ for Labeling:** High agreement (96.88%) with an alternative judge (DeepSeek-V3.2), while ANLS-based labeling degraded performance.
*   **Tool-Turn Weighting ($w_t=5$)**: Improved cold-start accuracy and tool-call reliability.
*   **Two-Stage Training:** GRPO-only training failed to learn effective tool use, while SFT-only led to over-cropping (high RTR). The combined approach was essential.
*   **Reward Components:** The asymmetric tool-cost and area penalty ($\lambda$) were necessary to achieve low RTR.

## Theoretical and Practical Implications
*   **Theoretical:** Demonstrates the feasibility and effectiveness of learning a **spatially-aware, adaptive perception policy** within a VLM. The CDP formulation treats the "when" and "where" decisions as inherently coupled, which is more aligned with the task structure than separate modules.
*   **Practical:** Provides a **deployment-friendly** efficiency solution. The tool-calling interface and multi-turn KV-cache reuse integrate smoothly with existing inference stacks (e.g., vLLM). The automatic data curation pipeline removes the need for costly manual spatial annotations, enabling scalable training. AwaRes offers a direct path to maintaining high-detail VLM capabilities under tight compute and latency budgets.

## Conclusion
AwaRes presents a spatial-on-demand inference framework that resolves the accuracy-efficiency trade-off in high-resolution VLM inference by selectively retrieving only necessary high-resolution crops via tool-calling. Trained with an automatic curation pipeline and a two-stage SFT+GRPO approach, it matches full-resolution performance at a fraction of the cost.

**Future Directions:**
1.  Extending crop selection from a discrete set to **continuous bounding box predictions**.
2.  Generalizing the approach to **video understanding**, exploiting temporal sparsity.
3.  Exploring **progressive multi-step** perception strategies that allocate resolution dynamically.

---

_Markdown view of https://picx.dev/p/kJJOVr, served by PicX — AI-generated visual whiteboard summaries of research papers._
