Summary of "Distilling Long-CoT Reasoning through Collaborative Step-wise Multi-Teacher Decoding"
Summary (Overview)
- Novel Framework: Introduces CoRD (Collaborative Reasoning Decoding), a paradigm shift from post-hoc curation to step-wise collaborative decoding where heterogeneous Large Reasoning Models (LRMs) jointly construct reasoning trajectories.
- Key Mechanisms: Employs three core components: prompt-guided step segmentation for consistent step units, predictive perplexity scoring for step-level quality evaluation, and beam search to preserve diverse, high-potential reasoning paths.
- Superior Performance: CoRD generates higher-quality reasoning data, leading to student models that approach or surpass teacher-level performance on mathematical reasoning benchmarks (AIME24, AIME25) with fewer, structured supervision signals.
- Efficient Collaboration: Achieves synergistic collaboration among teachers without the substantial computational overhead of methods like Monte Carlo Tree Search (MCTS), demonstrating better use of compute budget compared to curation-based approaches.
- Strong Generalization: The method generalizes effectively to out-of-domain (TaTQA) and open-ended, domain-specific (PubMedQA) reasoning tasks.
Introduction and Theoretical Foundation
The rapid progress in Large Reasoning Models (LRMs) like DeepSeek-R1 has unlocked complex problem-solving via Long Chain-of-Thought (Long-CoT) reasoning enabled by test-time scaling. However, the high computational cost of LRMs makes reasoning distillation into smaller models essential for practical deployment.
Existing approaches for distillation face significant challenges in the Long-CoT setting:
- Curation-based methods (e.g., S1, LIMO) follow a generate-then-select strategy. Multiple teachers generate complete reasoning traces independently, and the best one is selected post-hoc. This wastes computation on discarded candidates and fails to leverage collaborative potential among teachers during the reasoning process itself.
- Process Reward Models (PRMs) and Monte Carlo Tree Search (MCTS) are effective for short reasoning but become impractical for Long-CoT. PRMs may prematurely eliminate paths that could self-correct, and MCTS suffers from an exponentially growing search space.
The core problem is the lack of dynamic, step-wise collaboration among heterogeneous teachers to compose novel solution paths. CoRD addresses this by reformulating reasoning distillation as an incremental, collaborative decoding process.
Methodology
CoRD instantiates step-wise collaboration through three core components.
1. Prompt-guided Step Segmentation
To enable consistent cross-model comparison and collaboration, CoRD inserts explicit markers (### Step) into the initial prompt to guide LRMs to structure their reasoning into semantically coherent and functionally distinct steps. This ensures consistent step granularity compared to alternatives like line-break or prefix-based segmentation.
2. Perplexity-based Step Selection
At each decoding step , each teacher proposes a candidate next reasoning step conditioned on the current shared prefix . The quality of the extended trajectory is evaluated using a predictive perplexity score from a separate meta-prover (MP) model.
The score is derived from the meta-prover's conditional probability of the ground-truth answer :
A higher score indicates the step better predicts the correct answer. The step with the highest score is selected from the decoding vocabulary .
3. Step-wise Decoding with Beam Search
To avoid the short-sightedness of greedy decoding and the high cost of MCTS rollouts, CoRD integrates beam search. It maintains the top- most promising partial reasoning trajectories at each step.
Let the beam from the previous step be . Each prefix is extended with candidates from its decoding vocabulary, producing proposals. The beam is updated by selecting the top- extended trajectories with the highest predictive perplexity scores:
Computational Complexity Analysis
- CoRD (Beam Search):
- Greedy Decoding (Beam=1):
- MCTS:
- Curation:
Where is trajectory length, is number of teachers, is meta-prover cost, and is beam size/rollouts. CoRD is more efficient than MCTS and, while more costly than curation, yields substantially higher-quality reasoning.
Empirical Validation / Results
Experiments were conducted on mathematical reasoning benchmarks (AIME24, AIME25) using the LIMO-v1 dataset for distillation. Teacher pools included homogeneous (QwQ-32B with different temperatures) and heterogeneous (QwQ-32B, R1-Qwen-32B, Phi4-Reasoning-Plus) configurations.
Key Results Table
Table 2: Quality of the generated reasoning across three distillation pipelines.
| Teacher Config. | Distillation Pipeline | Answer Accuracy | Predictive Perplexity |
|---|---|---|---|
| Homo. | Curation | 77.4 | 0.664 |
| Integration | 88.6 | 0.215 | |
| CoRD | 90.0 | 0.726 | |
| Hetero. | Curation | 84.8 | 0.652 |
| Integration | 91.2 | 0.223 | |
| CoRD | 93.1 | 0.774 |
Table 3: Distillation performance comparison (Pass@1). Excerpt for R1-Qwen-32B student:
| Distillation Pipeline | AIME24 | AIME25 |
|---|---|---|
| w/o Distillation | 71.6 | 53.8 |
| Curation-Hetero | 75.0 | 62.1 |
| Integration-Hetero | 12.7 | 9.0 |
| CoRD-Hetero | 79.6 | 70.2 |
- Reasoning Quality: CoRD achieves the highest answer accuracy and predictive perplexity for generated reasoning, with advantages magnified under heterogeneous teachers.
- Student Performance: Students distilled with CoRD consistently achieve the highest Pass@1 scores, with the 32B student surpassing all individual teacher models on both benchmarks.
- Collaboration Dynamics: Analysis of teacher selection hit rates (Figure 2) shows specialized allocation—R1-Qwen-32B and QwQ-32B dominate early problem formulation, while Phi4-Reasoning-Plus takes over in later synthesis phases.
- Comparison to SOTA: CoRD-generated reasoning data leads to better student performance than datasets from S1k-1.1 and LIMO-v1/v2 (Figure 3).
- Component Ablation: Each component (prompt-guided segmentation, predictive perplexity scoring, beam search) outperforms its alternatives (Tables 4, 5, 6).
- Generalization: CoRD shows strong performance on out-of-domain (MATH500, TaTQA) and open-ended (PubMedQA) tasks (Table 7).
Theoretical and Practical Implications
- Paradigm Shift: CoRD demonstrates that dynamic, step-wise collaboration is more effective than static, post-hoc curation for distilling complex Long-CoT reasoning. It transforms reasoning from a one-shot selection problem into an incremental generation process.
- Quality over Quantity: The predictive perplexity metric, which measures how well reasoning guides toward the correct answer, is shown to be a stronger correlate of final student performance than simple answer accuracy. This highlights the importance of preserving the deliberative process itself, not just the final outcome.
- Efficient Synergy: The framework provides a computationally efficient method to harness the complementary strengths of heterogeneous LRMs, enabling the creation of reasoning trajectories that no single teacher could produce in isolation.
- Practical Distillation: CoRD enables the creation of high-performance, smaller student models that can match or exceed teacher capabilities, making advanced reasoning more accessible and deployable.
Conclusion
CoRD presents a novel framework for Long-CoT reasoning distillation that redefines the process as a collaborative, step-wise decoding task among multiple teacher LRMs. By integrating prompt-guided segmentation, predictive perplexity scoring, and beam search, it efficiently produces high-quality reasoning data that leads to student models achieving near or superior-to-teacher performance. The method's effectiveness across diverse benchmarks underscores the importance of fine-grained collaboration and progress-aware evaluation in scaling reasoning distillation.
Limitations & Future Work: The evaluation is primarily monolingual (English); future work will explore cross-lingual transfer. The current setup uses only Supervised Fine-Tuning (SFT); integrating preference learning (e.g., DPO) could further enhance performance.