Agent Explorative Policy Optimization for Multimodal Agentic Reasoning

Summary (Overview)

Identifies the "Thinking-Acting Gap": A key asymmetry in agentic reasoning where internal "thinking" is a safe, self-contained default, while external "tool use" (acting) is a high-variance, under-trained auxiliary behavior. This gap manifests during RL training as: tool use is attempted in only ~30% of rollouts, and when attempted, the tool-using subgroup fails entirely (~40% of the time), suppressing learning signals.
Proposes AXPO (Agent eXplorative Policy Optimization): An RL algorithm designed to close this gap. Its core mechanism is tool-call resampling: for failed tool-using subgroups, AXPO fixes the preceding "thinking" prefix and resamples the tool call and its continuation, concentrating exploration precisely at the high-variance action boundary.
Demonstrates significant performance gains: Across nine multimodal benchmarks and three model scales (2B, 4B, 8B) of Qwen3-VL-Thinking, SFT+AXPO consistently outperforms the standard SFT+GRPO baseline. Notably, the 8B model trained with AXPO surpasses the performance of the 4x larger 32B base model on Pass@4, demonstrating effective parameter efficiency.

Introduction and Theoretical Foundation

Large vision-language models (VLMs) with extended reasoning (e.g., chain-of-thought) excel at many tasks but struggle with real-world problems requiring external tools (e.g., web search, code execution, detailed image analysis). Multimodal Agentic Reasoning addresses this by having models autonomously interleave internal thinking with external tool use.

This paradigm introduces a fundamental asymmetry, termed the Thinking-Acting Gap:

Thinking: The model's native, self-contained mode. Errors can be corrected by exploring different internal chains of thought.
Tool Use (Acting): An auxiliary, high-variance mode. A short token sequence triggers an unpredictable external response. Small errors (e.g., in code syntax) can lead to complete failure.

While Supervised Fine-Tuning (SFT) teaches the mechanics of tool use, it fails to teach when and how to use tools optimally beyond demonstration data. Reinforcement Learning (RL) with outcome rewards is a natural next step. However, standard RL methods like GRPO (Group Relative Policy Optimization) are suboptimal due to the Thinking-Acting Gap, which suppresses the learning signal for tool-call tokens. This paper diagnoses the gap's symptoms and proposes AXPO to directly address them by restructuring exploration during RL.

Methodology

The agentic reasoning process is formalized. Given an input $x$ (question + image), the policy $\pi_\theta$ generates a trajectory $\tau$ interleaving thinking segments $t_t$ , actions (tool calls) $a_t$ , and observations $o_t = \text{exec}(a_t)$ until a final answer.

Baseline: GRPO

For an input $x$ , the policy generates a group of $N$ rollouts $\{\tau_i\}_{i=1}^N$ .
Each receives a binary outcome reward $r_i \in \{0,1\}$ based on answer correctness.
A group-normalized advantage is computed: $A_i = \frac{r_i - \text{mean}(\{r_j\}_{j=1}^N)}{\text{std}(\{r_j\}_{j=1}^N)}$
This advantage is assigned uniformly to every token in $\tau_i$ for policy updates via PPO.

AXPO: Agent eXplorative Policy Optimization AXPO augments the GRPO loop with targeted exploration:

Trigger Condition: Activated only for all-wrong tool-using subgroups (groups where the subset of rollouts that used a tool all failed, though no-tool rollouts in the same group may have succeeded). These are the cases where tool-call tokens receive zero or negative advantage.
Tool-Call Resampling: For a failed source rollout $\tau_{\text{src}}$ , AXPO fixes its thinking prefix $t^{\text{src}}_1$ (up to the opening <tool_call> tag) and draws $K$ new continuations: $\{y^{\text{res}}_k\}_{k=1}^K \sim \pi_\theta(\cdot | x, t^{\text{src}}_1)$ Each resampled trajectory $\tau^{\text{res}}_k = (t^{\text{src}}_1, y^{\text{res}}_k)$ is executed to completion and receives a reward $r^{\text{res}}_k$ .
Prefix Selection: Among triggered prefixes, AXPO ranks them by the mean policy probability over the tool-call tokens in the source rollout (a proxy for uncertainty) and resamples from the lowest-confidence prefixes first.
Advantage Calculation: To avoid gradient conflict, advantages are calculated separately:
- For Resampled Continuations: A per-prefix GRPO advantage is computed over the $K$ resamples and applied only to the continuation tokens $y^{\text{res}}_k$ : $\hat{A}^{\text{res}}_k(t^{\text{src}}_1) = \frac{r^{\text{res}}_k - \text{mean}(\{r^{\text{res}}_j\}_{j=1}^K)}{\text{std}(\{r^{\text{res}}_j\}_{j=1}^K)}$
- For the Source Prefix: The source rollout's prefix tokens receive an advantage based on a binary recovery reward $r^{\text{prefix}}(t^{\text{src}}_1) = \mathbb{1}[\exists k : r^{\text{res}}_k = 1]$ , which fires if any resampled continuation succeeds. This reward replaces the source's original reward in its group's GRPO normalization.

Theoretical Motivation: Proposition 1 shows that tool-call resampling strictly dominates raw sampling in recovering successful tool-using rollouts for a fixed budget, because it eliminates the waste of sampling non-tool-using rollouts. Formally, for a prefix $t^{\text{src}}_1$ with success probability $p(t^{\text{src}}_1) \geq q p_{\text{tool}}$ (where $q$ is the tool-use rate and $p_{\text{tool}}$ is the per-tool-using-rollout success rate):

1 - \left(1 - p(t^{\text{src}}_1)\right)^N_{\text{resampling}} \geq 1 - (1 - q p_{\text{tool}})^N_{\text{raw}}

with strict inequality when $p(t^{\text{src}}_1) > q p_{\text{tool}}$ and $q p_{\text{tool}} \in (0,1)$ .

Empirical Validation / Results

Diagnosing the Thinking-Acting Gap (Fig. 3):

Symptom 1 (Under-attempt): Under GRPO, only ~20-35% of rollouts attempt tool use.
Symptom 2 (All-wrong): When attempted, the tool-using subgroup fails entirely on ~40% of questions, compared to ~25% for no-tool subgroups.
Justification for Resampling: Resampling from a fixed thinking prefix yields diverse tool calls (2.9–3.4 semantic clusters per 16 samples), confirming the tool call as a valid divergence point.

Main Results (Table 1, Figure 1): SFT+AXPO outperforms SFT+GRPO across all three model scales (2B, 4B, 8B) on average Pass@1 and Pass@4 over nine benchmarks. The 8B model with AXPO achieves 99% of the Pass@1 and surpasses the Pass@4 of the 4x larger 32B base model.

Table 1: Main results — Pass@1 (%) on nine multimodal benchmarks.

Method	MathVision	DynaMath	Math-VR	V*	VisualProbe	HRBen-4K	HRBen-8K	HR-MMSearch	MM Search	Avg.
Qwen3-VL-8B-Thinking
Base	47.1	75.9	54.9	77.7	31.8	72.8	66.1	21.0	42.7	54.4
SFT+GRPO	55.3	78.2	60.4	87.7	40.1	79.5	74.9	24.4	44.0	60.5
SFT+AXPO (Ours)	56.1	79.0	60.6	87.8	45.8	83.3	77.0	25.9	45.0	62.3
Δ vs. SFT+GRPO	+0.8	+0.8	+0.2	+0.1	+5.7	+3.8	+2.1	+1.5	+1.0	+1.8
Qwen3-VL-32B-Thinking (Base)	56.5	83.3	64.1	89.1	40.3	85.3	78.9	22.8	46.1	62.9

Ablation Studies (Table 2): Removing any core component of AXPO (prefix fixing, targeting all-wrong subgroups, uncertainty ranking, prefix credit, or separated advantage groups) degrades performance, validating the design.

Comparison to Alternatives (Table 3): AXPO outperforms other RL recipes, including reward shaping (tool penalty/bonus), simply doubling the rollout budget, and alternative RL algorithms (RLTF, CISPO, ARPO).

Training Dynamics Analysis (Figures 4 & 5):

During training, AXPO increases the tool-use rate and reduces the all-wrong rate of tool-using subgroups compared to GRPO.
Resampling recovers correct trajectories from ~12% of all-wrong subgroups per step.
In evaluation, only AXPO advances simultaneously on both axes of tool-use frequency and conditional success rate (quality), whereas other methods trade one for the other.

Theoretical and Practical Implications

Theoretical: Provides a formal analysis (Proposition 1) of why targeted exploration at the tool-call boundary is more sample-efficient for improving tool-use capabilities than uniform sampling. It frames the agentic RL problem through the lens of the Thinking-Acting Gap.
Practical: AXPO offers a computationally efficient method to enhance the tool-using proficiency of VLMs. It enables smaller models (e.g., 8B parameters) to match or surpass the agentic reasoning performance of much larger base models (e.g., 32B), reducing deployment costs and latency. The method is benchmarked across diverse tasks (Reasoning, Perception, Search), showing broad applicability.

Conclusion

The paper identifies the Thinking-Acting Gap as a key challenge in training VLMs for agentic reasoning. The proposed AXPO algorithm effectively narrows this gap by introducing tool-call resampling, which concentrates exploration on the high-variance action boundary. Empirical results demonstrate that AXPO consistently improves over strong RL baselines, enabling parameter-efficient models to achieve superior performance. This work advances the training methodology for capable, tool-using multimodal agents.

Limitations & Future Work: The approach assumes the availability of verifiable outcome rewards for RL. Future directions could explore scaling to larger models, applying AXPO to a broader set of tools, and investigating its combination with other advanced RL techniques.