SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Summary (Overview)

Key Problem: Agentic Multimodal LLMs (MLLMs) achieve superior reasoning through iterative tool invocation, but this creates a strict stateful dependency chain (agentic depth). This sequential bottleneck leads to high per-query latency and collapses system-level concurrency.
Core Insight: A significant fraction of queries directed at agentic MLLMs do not require deep tool-assisted reasoning and can be answered correctly by a lightweight, tool-free MLLM from the original image alone.
Proposed Solution: SpecEyes, an agentic-level speculative acceleration framework. It uses a small non-agentic model as a speculative planner to answer queries early, governed by a novel cognitive gating mechanism based on answer separability, and organizes execution in a heterogeneous parallel funnel.
Main Results: On V* Bench, HR-Bench, and POPE, SpecEyes achieves an average speedup of 1.1–3.35× over agentic baselines (DeepEyes, Thyme) while preserving or even improving accuracy (up to +6.7%), and boosts serving throughput under concurrent workloads.

Introduction and Theoretical Foundation

Agentic Multimodal LLMs (MLLMs) represent a paradigm shift from static, single-pass perception to dynamic, iterative interaction with the visual world. Models like OpenAI o3 and Gemini Agentic Vision actively invoke external tools (e.g., zoom, OCR) in loops of perception, reasoning, and tool-calling. While this enables fine-grained reasoning, it introduces a severe efficiency crisis.

The core problem is the stateful bottleneck. Each query triggers a cascade of tool-calling steps, defined as the agentic depth $D$ . Each step depends causally on the observation from the previous step, creating a strict Markovian data dependency:

p(a_{d+1} | s_0, a_0, ..., s_d) = p(a_{d+1} | s_d, t_d(s_d)) \neq p(a_{d+1} | s_0)

This dependency imposes a dual disaster:

Latency Explosion: End-to-end response time grows linearly with $D$ : $L_{\text{agent}}(q) = \sum_{d=0}^{D(q)} (c_{\text{llm}} + c_{\text{tool}}(t_d))$ .
Concurrency Collapse: The per-query state mutation nullifies GPU batching, forcing the model to process queries one step at a time, leaving hardware parallelism idle.

Existing efficient reasoning methods (e.g., token-level speculative decoding, token pruning) operate within the fixed agentic loop and do not eliminate the repeated tool invocations that dominate latency.

SpecEyes makes a conceptual leap by lifting speculation from the token level to the agentic level. The key observation is that many queries do not require deep tool-assisted reasoning. The framework proposes a heterogeneous "think fast, think slow" architecture: a small, non-agentic model ("fast thinking") speculatively answers queries, while the large agentic model ("slow thinking") is reserved only for queries that genuinely need multi-step tool interaction.

Methodology

3.1 Modeling the Stateful Bottleneck

An agentic MLLM is formalized as a stateful system $A = (S, T, \pi)$ with state space $S$ , tool set $T$ , and policy $\pi$ . The state evolves over $D$ steps: $s_{d+1} = f(s_d, t_d(s_d))$ . This causal dependency makes the pipeline inherently sequential, bounding system throughput.

3.2 SpecEyes: Agentic-Level Speculative Reasoning

SpecEyes is a four-phase pipeline designed to bypass expensive tool chains whenever possible.

Phase I: Heuristic Tool-Use Judgment. The large agentic model $M_L$ performs a lightweight binary classification to judge if tool invocation is necessary:

g(q, I) = M_L(q, I; P_{\text{judge}}) \in \{0, 1\}

where $g=0$ indicates the query is answerable from the global image alone.

Phase II: Speculative Prediction. For queries with $g=0$ , the small non-agentic model $M_S$ generates an answer $\hat{y}_S$ and its full output logit distribution statelessly and concurrently:

\hat{y}_S, \{\ell^{(n)}\}_{n=1}^{|\hat{y}_S|} = M_S(q, I)

Phase III: Cognitive Gating. The logits are passed to a gating function $S_{\text{sep}}$ (detailed in Sec. 3.3) which computes a confidence score. A decision is made:

\text{decision} = \begin{cases} \text{accept } \hat{y}_S, & \text{if } S_{\text{sep}}(\hat{y}_S) \geq \tau \\ \text{fallback to } M_L, & \text{if } S_{\text{sep}}(\hat{y}_S) < \tau \end{cases}

Phase IV: Agentic Fallback. Queries rejected by the gate are routed to $M_L$ for full stateful execution: $\hat{y}_L = M_L(q, I)$ .

Let $\beta \in [0,1]$ be the tool-free screening ratio from Phase I and $\alpha \in [0,1]$ be the gate acceptance rate from Phase III. The expected per-query latency under SpecEyes is:

\mathbb{E}[L_{\text{SpecEyes}}] = c_J + \beta c_S + (1 - \beta\alpha) L_{\text{agent}}

where $c_J + \beta c_S \ll L_{\text{agent}}$ . Significant speedups occur when $\beta\alpha$ is large.

3.3 Cognitive Gating via Answer Separability

Instead of unreliable softmax-based confidence, SpecEyes introduces the answer separability score.

For the $n$ -th generated token with sorted logits $\ell^{(n)}_{[1]} \geq \ell^{(n)}_{[2]} \geq ...$ , the token-level separability is:

S^{(n)}_{\text{sep}} = \frac{\ell^{(n)}_{[1]} - \mu^{(n)}_K}{\sigma^{(n)}_K + \epsilon}

where $\mu^{(n)}_K$ and $\sigma^{(n)}_K$ are the mean and standard deviation of the top- $K$ logits. This metric is scale-invariant and explicitly models the competitive landscape.

The token-level scores are aggregated into an answer-level confidence. Three strategies are considered:

S^{\text{mean}}_{\text{sep}} = \frac{1}{|\hat{y}_S|} \sum_{n=1}^{|\hat{y}_S|} S^{(n)}_{\text{sep}}, \quad S^{\text{min}}_{\text{sep}} = \min_{n \in [|\hat{y}_S|]} S^{(n)}_{\text{sep}}, \quad S^{\text{bottom}}_{\text{sep}} = \frac{1}{|\mathcal{B}|} \sum_{n \in \mathcal{B}} S^{(n)}_{\text{sep}}

where $\mathcal{B}$ contains the bottom- $r$ fraction of tokens. $S^{\text{min}}_{\text{sep}}$ is adopted as the default because it acts as a worst-case guard, triggering fallback if any token exhibits low separability, which most tightly bounds the overall error probability.

3.4 Heterogeneous Parallelism for Throughput

SpecEyes organizes the four phases into a parallel funnel. Phases I and II are stateless and fully batch-parallelizable. Only the residual set of queries $\mathcal{R}$ , of size $|\mathcal{R}| = (1 - \beta\alpha)B$ , falls back to sequential agentic execution in Phase IV.

This design yields a system throughput speedup of approximately:

\Theta_{\text{SpecEyes}} / \Theta_{\text{agent}} \approx 1 / (1 - \beta\alpha)

which is governed by the screening ratio $\beta$ and the gate acceptance rate $\alpha$ .

Empirical Validation / Results

Setups: Evaluated on V* Bench (Direct Attributes, Relative Position), HR-Bench (4K, 8K), and POPE (Adversarial, Popular, Random). The small model $M_S$ is Qwen3-VL-2B. The large agentic models $M_L$ are DeepEyes and Thyme, capped at 5 tool-use steps.

Main Results: The key results are summarized in Table 1. SpecEyes (min) consistently delivers the best accuracy-speed trade-off.

Table 1: Main results on V, HR-Bench, and POPE. Spd. means wall-clock speedup.*

Method	V* (Attr.) Acc./Spd.	V* (Pos.) Acc./Spd.	HR-Bench (4K) Acc./Spd.	HR-Bench (8K) Acc./Spd.	POPE (Adv.) Acc./Spd.	POPE (Pop.) Acc./Spd.	POPE (Rand.) Acc./Spd.	Avg. Acc./Spd.
Based on DeepEyes
DeepEyes (Baseline)	90.43 / 1.00×	82.89 / 1.00×	75.85 / 1.00×	71.43 / 1.00×	78.43 / 1.00×	81.90 / 1.00×	88.83 / 1.00×	81.39 / 1.00×
SpecReason	80.19 / 0.61×	73.91 / 0.38×	80.43 / 0.44×	72.54 / 0.42×	49.10 / 0.38×	51.55 / 0.38×	60.20 / 0.37×	66.85 / 0.43×
SpecEyes (min)	90.43 / 1.53×	89.47 / 1.90×	75.85 / 1.13×	71.80 / 1.08×	85.13 / 2.13×	87.00 / 2.15×	90.13 / 2.19×	84.26 / 1.73×
Based on Thyme
Thyme (Baseline)	86.96 / 1.00×	82.89 / 1.00×	77.72 / 1.00×	72.43 / 1.00×	81.32 / 1.00×	84.53 / 1.00×	90.17 / 1.00×	82.29 / 1.00×
SpecReason	89.57 / 0.48×	75.00 / 0.53×	80.01 / 0.52×	81.02 / 0.51×	84.62 / 0.46×	85.97 / 0.43×	90.27 / 0.46×	83.78 / 0.48×
SpecEyes (min)	87.83 / 1.32×	82.89 / 1.42×	78.47 / 1.01×	73.31 / 0.95×	85.87 / 1.77×	88.30 / 1.78×	91.27 / 1.70×	83.99 / 1.42×

With DeepEyes, SpecEyes (min) achieves a 1.73× average speedup while improving accuracy from 81.39% to 84.26%. POPE benefits most (2.13–2.19× speedup).
With Thyme, SpecEyes (min) yields a 1.42× average speedup while raising accuracy from 82.29% to 83.99%.
SpecReason consistently decelerates inference (0.37–0.61×) and suffers significant accuracy drops on POPE.

Analysis of Confidence Calibration: Kernel Density Estimate (KDE) plots show that $S^{\text{min}}_{\text{sep}}$ achieves the largest peak separation ( $\Delta$ ) between correct and incorrect samples, confirming its superior discriminability for gating compared to log-probability ( $S_{\text{log}}$ ) or mean/bottom separability.

Ablation Studies:

Gating Threshold: Lowering the threshold increases speedup at the cost of accuracy, but a broad operating region exists where SpecEyes improves both metrics.
Batch Size: Larger batches amortize the stateless speculative stage, improving speedup with diminishing returns as the stateful fallback becomes the bottleneck.
Top- $K$ in Separability: Increasing $K$ acts as a control knob, improving speedup but degrading accuracy. $K=64$ is set as a balanced default.

Theoretical and Practical Implications

Theoretical: The paper formalizes the stateful bottleneck of agentic MLLMs and introduces agentic-level speculation as a new paradigm for efficiency. The answer separability metric provides a principled, calibration-free approach for confidence estimation in sequence generation.
Practical: SpecEyes offers a deployable solution to the latency and concurrency crisis of state-of-the-art agentic MLLMs. It enables substantial speedups (1.1–3.35×) and throughput gains while preserving or enhancing accuracy, making agentic models more viable for real-time applications and high-load serving scenarios. The framework is model-agnostic and compatible with existing agentic MLLM backbones.

Conclusion

SpecEyes is an agentic-level speculative acceleration framework that breaks the sequential bottleneck of tool-invoking MLLMs. By using a lightweight model for speculative planning, a cognitive gate based on answer separability for reliable switching, and a heterogeneous parallel architecture for throughput, it achieves significant latency reduction and concurrency improvement without sacrificing accuracy.

Future Work: The current speculative model operates at agentic depth $D=0$ (fully tool-free). A natural extension is multi-depth speculation, allowing the speculative model a bounded number of lightweight tool calls before gating, which could further reduce fallbacks for queries requiring moderate tool assistance.