SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning
Summary (Overview)
- Key Problem: Agentic Multimodal LLMs (MLLMs) achieve superior reasoning through iterative tool invocation, but this creates a strict stateful dependency chain (agentic depth). This sequential bottleneck leads to high per-query latency and collapses system-level concurrency.
- Core Insight: A significant fraction of queries directed at agentic MLLMs do not require deep tool-assisted reasoning and can be answered correctly by a lightweight, tool-free MLLM from the original image alone.
- Proposed Solution: SpecEyes, an agentic-level speculative acceleration framework. It uses a small non-agentic model as a speculative planner to answer queries early, governed by a novel cognitive gating mechanism based on answer separability, and organizes execution in a heterogeneous parallel funnel.
- Main Results: On V* Bench, HR-Bench, and POPE, SpecEyes achieves an average speedup of 1.1–3.35× over agentic baselines (DeepEyes, Thyme) while preserving or even improving accuracy (up to +6.7%), and boosts serving throughput under concurrent workloads.
Introduction and Theoretical Foundation
Agentic Multimodal LLMs (MLLMs) represent a paradigm shift from static, single-pass perception to dynamic, iterative interaction with the visual world. Models like OpenAI o3 and Gemini Agentic Vision actively invoke external tools (e.g., zoom, OCR) in loops of perception, reasoning, and tool-calling. While this enables fine-grained reasoning, it introduces a severe efficiency crisis.
The core problem is the stateful bottleneck. Each query triggers a cascade of tool-calling steps, defined as the agentic depth . Each step depends causally on the observation from the previous step, creating a strict Markovian data dependency:
This dependency imposes a dual disaster:
- Latency Explosion: End-to-end response time grows linearly with : .
- Concurrency Collapse: The per-query state mutation nullifies GPU batching, forcing the model to process queries one step at a time, leaving hardware parallelism idle.
Existing efficient reasoning methods (e.g., token-level speculative decoding, token pruning) operate within the fixed agentic loop and do not eliminate the repeated tool invocations that dominate latency.
SpecEyes makes a conceptual leap by lifting speculation from the token level to the agentic level. The key observation is that many queries do not require deep tool-assisted reasoning. The framework proposes a heterogeneous "think fast, think slow" architecture: a small, non-agentic model ("fast thinking") speculatively answers queries, while the large agentic model ("slow thinking") is reserved only for queries that genuinely need multi-step tool interaction.
Methodology
3.1 Modeling the Stateful Bottleneck
An agentic MLLM is formalized as a stateful system with state space , tool set , and policy . The state evolves over steps: . This causal dependency makes the pipeline inherently sequential, bounding system throughput.
3.2 SpecEyes: Agentic-Level Speculative Reasoning
SpecEyes is a four-phase pipeline designed to bypass expensive tool chains whenever possible.
Phase I: Heuristic Tool-Use Judgment. The large agentic model performs a lightweight binary classification to judge if tool invocation is necessary:
where indicates the query is answerable from the global image alone.
Phase II: Speculative Prediction. For queries with , the small non-agentic model generates an answer and its full output logit distribution statelessly and concurrently:
Phase III: Cognitive Gating. The logits are passed to a gating function (detailed in Sec. 3.3) which computes a confidence score. A decision is made:
Phase IV: Agentic Fallback. Queries rejected by the gate are routed to for full stateful execution: .
Let be the tool-free screening ratio from Phase I and be the gate acceptance rate from Phase III. The expected per-query latency under SpecEyes is:
where . Significant speedups occur when is large.
3.3 Cognitive Gating via Answer Separability
Instead of unreliable softmax-based confidence, SpecEyes introduces the answer separability score.
For the -th generated token with sorted logits , the token-level separability is:
where and are the mean and standard deviation of the top- logits. This metric is scale-invariant and explicitly models the competitive landscape.
The token-level scores are aggregated into an answer-level confidence. Three strategies are considered:
where contains the bottom- fraction of tokens. is adopted as the default because it acts as a worst-case guard, triggering fallback if any token exhibits low separability, which most tightly bounds the overall error probability.
3.4 Heterogeneous Parallelism for Throughput
SpecEyes organizes the four phases into a parallel funnel. Phases I and II are stateless and fully batch-parallelizable. Only the residual set of queries , of size , falls back to sequential agentic execution in Phase IV.
This design yields a system throughput speedup of approximately:
which is governed by the screening ratio and the gate acceptance rate .
Empirical Validation / Results
Setups: Evaluated on V* Bench (Direct Attributes, Relative Position), HR-Bench (4K, 8K), and POPE (Adversarial, Popular, Random). The small model is Qwen3-VL-2B. The large agentic models are DeepEyes and Thyme, capped at 5 tool-use steps.
Main Results: The key results are summarized in Table 1. SpecEyes (min) consistently delivers the best accuracy-speed trade-off.
Table 1: Main results on V, HR-Bench, and POPE. Spd. means wall-clock speedup.*
| Method | V* (Attr.) Acc./Spd. | V* (Pos.) Acc./Spd. | HR-Bench (4K) Acc./Spd. | HR-Bench (8K) Acc./Spd. | POPE (Adv.) Acc./Spd. | POPE (Pop.) Acc./Spd. | POPE (Rand.) Acc./Spd. | Avg. Acc./Spd. |
|---|---|---|---|---|---|---|---|---|
| Based on DeepEyes | ||||||||
| DeepEyes (Baseline) | 90.43 / 1.00× | 82.89 / 1.00× | 75.85 / 1.00× | 71.43 / 1.00× | 78.43 / 1.00× | 81.90 / 1.00× | 88.83 / 1.00× | 81.39 / 1.00× |
| SpecReason | 80.19 / 0.61× | 73.91 / 0.38× | 80.43 / 0.44× | 72.54 / 0.42× | 49.10 / 0.38× | 51.55 / 0.38× | 60.20 / 0.37× | 66.85 / 0.43× |
| SpecEyes (min) | 90.43 / 1.53× | 89.47 / 1.90× | 75.85 / 1.13× | 71.80 / 1.08× | 85.13 / 2.13× | 87.00 / 2.15× | 90.13 / 2.19× | 84.26 / 1.73× |
| Based on Thyme | ||||||||
| Thyme (Baseline) | 86.96 / 1.00× | 82.89 / 1.00× | 77.72 / 1.00× | 72.43 / 1.00× | 81.32 / 1.00× | 84.53 / 1.00× | 90.17 / 1.00× | 82.29 / 1.00× |
| SpecReason | 89.57 / 0.48× | 75.00 / 0.53× | 80.01 / 0.52× | 81.02 / 0.51× | 84.62 / 0.46× | 85.97 / 0.43× | 90.27 / 0.46× | 83.78 / 0.48× |
| SpecEyes (min) | 87.83 / 1.32× | 82.89 / 1.42× | 78.47 / 1.01× | 73.31 / 0.95× | 85.87 / 1.77× | 88.30 / 1.78× | 91.27 / 1.70× | 83.99 / 1.42× |
- With DeepEyes, SpecEyes (min) achieves a 1.73× average speedup while improving accuracy from 81.39% to 84.26%. POPE benefits most (2.13–2.19× speedup).
- With Thyme, SpecEyes (min) yields a 1.42× average speedup while raising accuracy from 82.29% to 83.99%.
- SpecReason consistently decelerates inference (0.37–0.61×) and suffers significant accuracy drops on POPE.
Analysis of Confidence Calibration: Kernel Density Estimate (KDE) plots show that achieves the largest peak separation () between correct and incorrect samples, confirming its superior discriminability for gating compared to log-probability () or mean/bottom separability.
Ablation Studies:
- Gating Threshold: Lowering the threshold increases speedup at the cost of accuracy, but a broad operating region exists where SpecEyes improves both metrics.
- Batch Size: Larger batches amortize the stateless speculative stage, improving speedup with diminishing returns as the stateful fallback becomes the bottleneck.
- Top- in Separability: Increasing acts as a control knob, improving speedup but degrading accuracy. is set as a balanced default.
Theoretical and Practical Implications
- Theoretical: The paper formalizes the stateful bottleneck of agentic MLLMs and introduces agentic-level speculation as a new paradigm for efficiency. The answer separability metric provides a principled, calibration-free approach for confidence estimation in sequence generation.
- Practical: SpecEyes offers a deployable solution to the latency and concurrency crisis of state-of-the-art agentic MLLMs. It enables substantial speedups (1.1–3.35×) and throughput gains while preserving or enhancing accuracy, making agentic models more viable for real-time applications and high-load serving scenarios. The framework is model-agnostic and compatible with existing agentic MLLM backbones.
Conclusion
SpecEyes is an agentic-level speculative acceleration framework that breaks the sequential bottleneck of tool-invoking MLLMs. By using a lightweight model for speculative planning, a cognitive gate based on answer separability for reliable switching, and a heterogeneous parallel architecture for throughput, it achieves significant latency reduction and concurrency improvement without sacrificing accuracy.
Future Work: The current speculative model operates at agentic depth (fully tool-free). A natural extension is multi-depth speculation, allowing the speculative model a bounded number of lightweight tool calls before gating, which could further reduce fallbacks for queries requiring moderate tool assistance.