TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

Summary (Overview)

TMAS Framework: A novel multi-agent framework for scaling test-time compute that organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations.
Hierarchical Memory: Introduces two complementary memory banks: an Experience Bank for low-level reliable intermediate conclusions and local feedback, and a Guideline Bank for high-level strategies to steer subsequent rollouts away from redundant reasoning patterns.
Hybrid Reward RL: A tailored reinforcement learning scheme with three complementary objectives: preserving basic reasoning capability, enhancing experience utilization, and encouraging exploration beyond previously attempted strategies.
Strong Iterative Scaling: Demonstrates superior performance compared to existing test-time scaling baselines on challenging reasoning benchmarks, with hybrid reward RL further improving scaling effectiveness and stability across iterations.
Synergy-Driven Gains: Ablation studies confirm that the synergy between experience and guideline modules is essential, with each contributing complementary gains to the iterative scaling process.

Introduction and Theoretical Foundation

Test-time scaling (TTS) has evolved from increasing computation within a single generation to organizing reasoning across multiple refinement rounds or candidate trajectories. Recent structured approaches like PaCoRe and verify-refine paradigms (e.g., DeepSeek-Math-V2) advance this by aggregating historical information or using explicit feedback. However, these methods often weakly coordinate parallel trajectories or rely on noisy historical information without explicit selection, limiting their ability to balance exploration and exploitation.

TMAS aims to extend these paradigms with explicit cross-trajectory collaboration by addressing three key challenges:

Multi-agent synergy: Coordinating specialized agents and managing information flow across trajectories and iterations.
Hierarchical memory management: Preserving both global solution strategies and reliable local reasoning states (verified anchors, intermediate conclusions) for effective sharing and reuse.
Exploration–exploitation balance: Ensuring models explore diverse solution paths while exploiting reliable accumulated experience.

The theoretical foundation builds on the observation that effective multi-agent reasoning requires not only structural design but also targeted training to coordinate exploration and exploitation during iterative reasoning.

Methodology

Overall Framework

TMAS integrates parallel exploration with sequential exploitation. At each iteration, it explores multiple reasoning paths in parallel and accumulates useful signals for subsequent refinement. It employs five specialized agents and a memory-bank-based communication mechanism.

Key Components:

Experience Bank ( $E_t$ ): Stores low-level, trajectory-specific reasoning signals (verified intermediate conclusions, concrete skills, verifier-identified errors).
Guideline Bank ( $G_t$ ): Stores high-level strategic memory (global solution directions, key structural insights, previously explored strategies) to promote non-redundant exploration.

These hierarchical memories serve as the communication substrate for multi-agent synergy.

Multi-Agent Inference System

The inference process runs for $T$ iterations per problem $Q$ . Each iteration $t$ consists of parallel solution generation, verification, summarization, and memory update.

Agent Definitions and Roles:

Solution Agent ( $A_{sol}$ ): Generates $N$ candidate solution trajectories $\{c_{t,i}\}_{i=1}^N$ . It uses an exploration coefficient $\epsilon$ to balance exploitation and exploration:
$c_{t,i} \sim \begin{cases} A_{sol}(Q, R_{t-1}, E_{t-1}), & \text{with probability } 1-\epsilon, \\ A_{sol}(Q, G_{t-1}), & \text{with probability } \epsilon. \end{cases}$
The first branch exploits previous rollouts and experience; the second encourages non-redundant exploration guided by high-level guidelines.
Verification Agent ( $A_{ver}$ ): Evaluates each candidate $c_{t,i}$ through $M$ independent verification passes, producing a verification set $V_{t,i} = \{A_{ver}^{(m)}(Q, c_{t,i})\}_{m=1}^M$ . Each output provides analytical feedback and scalar scores.
Summary Agent ( $A_{sum}$ ): Aggregates verification results for each candidate into a concise summary $s_{t,i} = A_{sum}(Q, c_{t,i}, V_{t,i})$ .
Experience Agent ( $A_{exp}$ ): Updates the experience bank: $E_t = A_{exp}(Q, R_t, E_{t-1})$ . Extracts reusable experience from the rollout set $R_t = \{r_{t,i}\}_{i=1}^N$ , where $r_{t,i} = (c_{t,i}, s_{t,i})$ .
Guideline Agent ( $A_{guide}$ ): Updates the guideline bank: $G_t = A_{guide}(Q, R_t, G_{t-1})$ . Abstracts distinct high-level solution strategies attempted across parallel rollouts.

Hybrid Reward System with RLVR

To align the model with TMAS's collaborative process, a hybrid reward system is designed based on GRPO (Group Relative Policy Optimization). The standard GRPO clipped objective is:

J_{\text{GRPO}}(\theta) = E_{Q,\{o_i\}} \left[ \frac{1}{\sum_i |o_i|} \sum_{i=1}^{N} \sum_{t=1}^{|o_i|} \min\left( \rho_{i,t} A_i, \text{clip}(\rho_{i,t}, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}) A_i \right) \right],

where $\rho_{i,t} = \pi_\theta(o_{i,t} | Q, o_{i,<t}) / \pi_{\theta_{\text{old}}}(o_{i,t} | Q, o_{i,<t})$ and $A_i = (\tilde{r}_i - \mu)/(\sigma + \delta)$ is the group-normalized advantage. The hybrid reward modifies $\tilde{r}_i$ with three components:

Standard Correctness Reward: Preserves core reasoning capability. $\tilde{r}_i = 1$ if the final answer is correct, $\tilde{r}_i = -1$ otherwise.
Experience Utilization Reward: Encourages effective use of the experience bank. Rollouts are partitioned into Base ( $B_{\text{base}}$ ) and Bank ( $B_{\text{bank}}$ ) groups. The base accuracy $p_{\text{base}}$ serves as a difficulty proxy:
$p_{\text{base}} = \frac{1}{|B_{\text{base}}|} \sum_{i \in B_{\text{base}}} \mathbb{I}[r_i = 1].$
The reshaped reward is:
$\tilde{r}_i = \begin{cases} r_i + \beta(1 - p_{\text{base}}), & i \in B_{\text{bank}}, r_i = 1, \\ r_i, & \text{otherwise}, \end{cases}$
where $\beta$ is the maximum bonus coefficient.
Novel Strategy Exploration Reward: Encourages discovery of new solution strategies beyond summarized guideline memory. For each rollout, $r_i \in \{+1, -1\}$ indicates correctness, and $n_i \in \{0,1\}$ indicates guideline novelty ( $n_i=1$ for novel, $n_i=0$ for previously observed). The reward is defined as:
$\tilde{r}_i = \begin{cases} +1.0, & r_i = +1, n_i = 1, \\ +0.2, & r_i = +1, n_i = 0, \\ -0.5, & r_i = -1, n_i = 1, \\ -1.0, & r_i = -1, n_i = 0. \end{cases}$

Empirical Validation / Results

Experimental Setup

Benchmarks: IMO-AnswerBench-50 (filtered subset) and HLE-Math-100 (mathematics subset of Human's Last Exam).
Base Models: Qwen3-30B-A3B-Thinking-2507 and Qwen3-4B-Thinking-2507.
Baselines: Majority Vote (MV), Self-Refine, Verify-Refine (V-R), PaCoRe, RSE.
TMAS Configuration: $N=8$ parallel solution trajectories, $M=8$ verification agents per trajectory, $\epsilon=0.2$ , max iterations=20.
RL Training: Hybrid reward RL applied to Qwen3-4B-Thinking-2507 backbone.

Main Results

Table 1. Performance comparison across different methods and representative refinement iterations on IMO-AnswerBench-50 and HLE-Math-100.

Method	IMO-AnswerBench-50	HLE-Math-100
	It1	It3
Qwen3-30B-Thinking-2507
MV@64	24.00	-
Self-Refine	9.06	12.75
V-R	10.56	16.31
PaCoRe	26.56	29.31
RSE	25.31	25.38
TMAS	22.06	28.56
Qwen3-4B-Thinking-2507
MV@64	6.00	-
Self-Refine	5.50	5.44
V-R	6.00	7.69
PaCoRe	7.62	10.75
RSE	11.38	13.31
TMAS	6.62	12.88
w/ Hybrid-RL	15.38	22.69

Key Findings:

TMAS demonstrates stronger iterative scaling ability: While baselines plateau, TMAS continues to benefit from additional refinement rounds, achieving the best late-stage performance (e.g., 40.50% on IMO-AnswerBench-50 with Qwen3-30B).
Hybrid reward RL unlocks superior and sustained iterative scaling: TMAS+Hybrid-RL consistently outperforms TMAS without RL and other baselines. It not only achieves higher peak accuracy but also mitigates performance degradation in later iterations compared to Vanilla-RL.
RL narrows the performance gap between model sizes: Hybrid RL reduces the performance gap between the 4B and 30B TMAS models from 23.44 to 9.62 points on IMO-AnswerBench-50 (59.0% reduction) and from 17.97 to 7.22 points on HLE-Math-100 (59.8% reduction).

Ablation and Analysis

Table 2. Component ablation study on IMO-AnswerBench-50.

Method	Iteration
	It0
TMAS	10.88
w/o guideline	6.31
w/o experience	9.44
w/o both	8.61

Key Insights from Ablation and Sensitivity Analysis:

Experience and guidelines drive complementary iterative gains: Removing either module degrades performance. The "w/o guideline" variant suffers most in early iterations, indicating guidelines help steer quickly. The "w/o experience" variant shows weaker gains later, indicating experience is critical for sustained refinement.
Moderate exploration coefficient ( $\epsilon = 0.2$ ) optimizes balance: Both purely exploitative ( $\epsilon = 0$ ) and overly exploratory ( $\epsilon = 1.0$ ) settings yield suboptimal outcomes.
Optimal verification budget prevents noise: Verification is essential (count=0 yields lowest accuracy), but an intermediate count of 8 achieves the best results. Increasing to 16 provides no benefit and can degrade performance.
Performance gains saturate with excessive parallel solutions: Too few solutions limit diversity, but increasing beyond 8 (e.g., to 12) yields limited or unstable gains, indicating difficulty in integrating additional trajectories.

Theoretical and Practical Implications

Structured Test-Time Scaling: TMAS provides a blueprint for organizing test-time compute as a coordinated multi-agent process with explicit information flow, moving beyond weakly coupled parallel trajectories.
Hierarchical Memory Design: The separation of low-level experience and high-level guideline memory offers a principled approach to managing reusable reasoning signals at different granularities, addressing the exploration-exploitation trade-off.
Tailored RL for Multi-Agent Systems: The hybrid reward scheme demonstrates that training objectives beyond final correctness (experience utilization, novel strategy exploration) are crucial for aligning models with complex iterative, collaborative inference frameworks.
Efficient Use of Compute: The findings on optimal verification counts and parallel solution budgets provide practical guidance for configuring similar systems to maximize refinement efficiency without wasting computation.
Scalability for Smaller Models: The significant performance boost from hybrid RL on a 4B model suggests that targeted training can enable smaller models to approach the performance of much larger models through more effective iterative test-time computation, potentially reducing resource requirements.

Conclusion

TMAS is a multi-agent test-time scaling framework that coordinates solution generation, verification, feedback summarization, experience extraction, and guideline updating into a unified iterative inference process. Its hierarchical memory mechanisms and hybrid reward RL scheme enable stronger iterative scaling than existing baselines. The framework effectively balances exploration and exploitation, translating additional test-time compute into improved performance on challenging reasoning problems.

Limitations and Future Work:

Due to computational constraints, TMAS has not been evaluated on frontier models like GPT-5.5, where the upper bound of multi-agent synergy could be further examined.
The current RL pipeline requires an external model to pre-construct cold-start trajectories and memory-based training data. Future work can dynamically incorporate trajectories and memory signals from previous iterations into the RL data pool for continuous adaptation.