Summary of "Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding"

Summary (Overview)

  • Novel Framework: Introduces CoRD (Collaborative Reasoning Decoding), a paradigm shift from post-hoc curation to step-wise collaborative decoding where heterogeneous Large Reasoning Models (LRMs) jointly construct reasoning trajectories.
  • Key Mechanisms: Employs three core components: prompt-guided step segmentation for consistent step units, predictive perplexity scoring for step-level quality evaluation, and beam search to preserve diverse, high-potential reasoning paths.
  • Superior Performance: CoRD generates higher-quality reasoning data, leading to student models that approach or surpass teacher-level performance on mathematical reasoning benchmarks (AIME24, AIME25) with fewer, structured supervision signals.
  • Efficient Collaboration: Achieves synergistic collaboration among teachers without the substantial computational overhead of methods like Monte Carlo Tree Search (MCTS), demonstrating better use of compute budget compared to curation-based approaches.
  • Strong Generalization: The method generalizes effectively to out-of-domain (TaTQA) and open-ended, domain-specific (PubMedQA) reasoning tasks.

Introduction and Theoretical Foundation

The rapid progress in Large Reasoning Models (LRMs) like DeepSeek-R1 has unlocked complex problem-solving via Long Chain-of-Thought (Long-CoT) reasoning enabled by test-time scaling. However, the high computational cost of LRMs makes reasoning distillation into smaller models essential for practical deployment.

Existing approaches for distillation face significant challenges in the Long-CoT setting:

  1. Curation-based methods (e.g., S1, LIMO) follow a generate-then-select strategy. Multiple teachers generate complete reasoning traces independently, and the best one is selected post-hoc. This wastes computation on discarded candidates and fails to leverage collaborative potential among teachers during the reasoning process itself.
  2. Process Reward Models (PRMs) and Monte Carlo Tree Search (MCTS) are effective for short reasoning but become impractical for Long-CoT. PRMs may prematurely eliminate paths that could self-correct, and MCTS suffers from an exponentially growing search space.

The core problem is the lack of dynamic, step-wise collaboration among heterogeneous teachers to compose novel solution paths. CoRD addresses this by reformulating reasoning distillation as an incremental, collaborative decoding process.

Methodology

CoRD instantiates step-wise collaboration through three core components.

1. Prompt-guided Step Segmentation

To enable consistent cross-model comparison and collaboration, CoRD inserts explicit markers (### Step) into the initial prompt to guide LRMs to structure their reasoning into semantically coherent and functionally distinct steps. This ensures consistent step granularity compared to alternatives like line-break or prefix-based segmentation.

2. Perplexity-based Step Selection

At each decoding step tt, each teacher kk proposes a candidate next reasoning step st(k)s_t^{(k)} conditioned on the current shared prefix τ<t\tau_{<t}. The quality of the extended trajectory τ<tst(k)\tau_{<t} \oplus s_t^{(k)} is evaluated using a predictive perplexity score from a separate meta-prover (MP) model.

The score S(τ<tst(k))S(\tau_{<t} \oplus s_t^{(k)}) is derived from the meta-prover's conditional probability of the ground-truth answer A=(a1,...,aM)A = (a_1, ..., a_M):

pmeta(Aτ<tst(k))=m=1Mpmeta(amτ<tst(k),a<m)p_{\text{meta}}(A | \tau_{<t} \oplus s_t^{(k)}) = \prod_{m=1}^{M} p_{\text{meta}}(a_m | \tau_{<t} \oplus s_t^{(k)}, a_{<m}) S(τ<tst(k))=exp(1Mlogpmeta(Aτ<tst(k)))S(\tau_{<t} \oplus s_t^{(k)}) = \exp\left( \frac{1}{M} \log p_{\text{meta}}(A | \tau_{<t} \oplus s_t^{(k)}) \right)

A higher score indicates the step better predicts the correct answer. The step with the highest score sts_t^* is selected from the decoding vocabulary Vt={st(1),st(2),...,st(K)}\mathcal{V}_t = \{s_t^{(1)}, s_t^{(2)}, ..., s_t^{(K)}\}.

3. Step-wise Decoding with Beam Search

To avoid the short-sightedness of greedy decoding and the high cost of MCTS rollouts, CoRD integrates beam search. It maintains the top-BB most promising partial reasoning trajectories at each step.

Let the beam from the previous step be Bt1={τ<t(1),τ<t(2),...,τ<t(B)}\mathcal{B}_{t-1} = \{\tau_{<t}^{(1)}, \tau_{<t}^{(2)}, ..., \tau_{<t}^{(B)}\}. Each prefix is extended with candidates from its decoding vocabulary, producing B×KB \times K proposals. The beam is updated by selecting the top-BB extended trajectories with the highest predictive perplexity scores:

Ct={τ<t(b)st(k)τ<t(b)Bt1,st(k)Vt(b)}\mathcal{C}_t = \{ \tau_{<t}^{(b)} \oplus s_t^{(k)} | \tau_{<t}^{(b)} \in \mathcal{B}_{t-1}, s_t^{(k)} \in \mathcal{V}_t^{(b)} \} Bt=Top-B(Ct)\mathcal{B}_t = \text{Top-}B(\mathcal{C}_t)

Computational Complexity Analysis

  • CoRD (Beam Search): O(TKMB)\mathcal{O}(T K M B)
  • Greedy Decoding (Beam=1): O(TKM)\mathcal{O}(T K M)
  • MCTS: O(TKlog(T)MB)\mathcal{O}(T K \log(T) M B)
  • Curation: O(TKB)\mathcal{O}(T K B)

Where TT is trajectory length, KK is number of teachers, MM is meta-prover cost, and BB is beam size/rollouts. CoRD is more efficient than MCTS and, while more costly than curation, yields substantially higher-quality reasoning.

Empirical Validation / Results

Experiments were conducted on mathematical reasoning benchmarks (AIME24, AIME25) using the LIMO-v1 dataset for distillation. Teacher pools included homogeneous (QwQ-32B with different temperatures) and heterogeneous (QwQ-32B, R1-Qwen-32B, Phi4-Reasoning-Plus) configurations.

Key Results Table

Table 2: Quality of the generated reasoning across three distillation pipelines.

Teacher Config.Distillation PipelineAnswer AccuracyPredictive Perplexity
Homo.Curation77.40.664
Integration88.60.215
CoRD90.00.726
Hetero.Curation84.80.652
Integration91.20.223
CoRD93.10.774

Table 3: Distillation performance comparison (Pass@1). Excerpt for R1-Qwen-32B student:

Distillation PipelineAIME24AIME25
w/o Distillation71.653.8
Curation-Hetero75.062.1
Integration-Hetero12.79.0
CoRD-Hetero79.670.2
  • Reasoning Quality: CoRD achieves the highest answer accuracy and predictive perplexity for generated reasoning, with advantages magnified under heterogeneous teachers.
  • Student Performance: Students distilled with CoRD consistently achieve the highest Pass@1 scores, with the 32B student surpassing all individual teacher models on both benchmarks.
  • Collaboration Dynamics: Analysis of teacher selection hit rates (Figure 2) shows specialized allocation—R1-Qwen-32B and QwQ-32B dominate early problem formulation, while Phi4-Reasoning-Plus takes over in later synthesis phases.
  • Comparison to SOTA: CoRD-generated reasoning data leads to better student performance than datasets from S1k-1.1 and LIMO-v1/v2 (Figure 3).
  • Component Ablation: Each component (prompt-guided segmentation, predictive perplexity scoring, beam search) outperforms its alternatives (Tables 4, 5, 6).
  • Generalization: CoRD shows strong performance on out-of-domain (MATH500, TaTQA) and open-ended (PubMedQA) tasks (Table 7).

Theoretical and Practical Implications

  • Paradigm Shift: CoRD demonstrates that dynamic, step-wise collaboration is more effective than static, post-hoc curation for distilling complex Long-CoT reasoning. It transforms reasoning from a one-shot selection problem into an incremental generation process.
  • Quality over Quantity: The predictive perplexity metric, which measures how well reasoning guides toward the correct answer, is shown to be a stronger correlate of final student performance than simple answer accuracy. This highlights the importance of preserving the deliberative process itself, not just the final outcome.
  • Efficient Synergy: The framework provides a computationally efficient method to harness the complementary strengths of heterogeneous LRMs, enabling the creation of reasoning trajectories that no single teacher could produce in isolation.
  • Practical Distillation: CoRD enables the creation of high-performance, smaller student models that can match or exceed teacher capabilities, making advanced reasoning more accessible and deployable.

Conclusion

CoRD presents a novel framework for Long-CoT reasoning distillation that redefines the process as a collaborative, step-wise decoding task among multiple teacher LRMs. By integrating prompt-guided segmentation, predictive perplexity scoring, and beam search, it efficiently produces high-quality reasoning data that leads to student models achieving near or superior-to-teacher performance. The method's effectiveness across diverse benchmarks underscores the importance of fine-grained collaboration and progress-aware evaluation in scaling reasoning distillation.

Limitations & Future Work: The evaluation is primarily monolingual (English); future work will explore cross-lingual transfer. The current setup uses only Supervised Fine-Tuning (SFT); integrating preference learning (e.g., DPO) could further enhance performance.