TAPS: Task Aware Proposal Distributions for Speculative Sampling - Summary

Summary (Overview)

Task-Specific Training Improves Performance: Draft models trained on a domain-specific dataset (e.g., MathInstruct for math, ShareGPT for conversation) achieve significantly higher acceptance lengths on matched downstream tasks compared to mismatched training.
Mixed-Data Training Offers Robustness but Not Dominance: Training on a mixture of data from multiple domains improves cross-task robustness, but larger mixtures do not uniformly outperform smaller ones across different decoding temperatures.
Inference-Time Composition Outperforms Weight Merging: When multiple specialized drafters are available, combining them at inference time via confidence-based routing or merged-tree verification yields much higher acceptance lengths than naive checkpoint weight averaging.
Confidence is a Superior Routing Signal: For selecting between specialized drafters, the draft model's confidence is a more discriminative and useful signal than token entropy, which is better suited as a diagnostic tool for understanding rejections.
Speculative Decoding is Depth-Aware: Acceptance rates decline with speculative depth, and the dominance of the task-matched specialist becomes more pronounced at deeper levels, especially on reasoning-heavy tasks.

Introduction and Theoretical Foundation

Autoregressive decoding is a major bottleneck for Large Language Model (LLM) inference. Speculative decoding addresses this by using a lightweight draft model to propose several future tokens, which a larger target model then verifies in parallel. This can improve throughput without altering the target model's final output distribution, as guaranteed by a lossless verification rule.

The core appeal depends on the quality of the draft model's proposal distribution. While prior work has focused on improving draft architectures (e.g., EAGLE, HASS) or verification procedures, the role of the draft model's training data distribution has been under-studied. This paper investigates whether aligning the drafter's training data with the target workload improves speculative decoding performance. It also explores practical questions arising in an ecosystem with multiple specialized model checkpoints: is it better to mix data during training, merge models in weight space, or compose them at inference time?

The research is structured around five key questions (RQs):

RQ1: Does task-specific draft training improve performance on matched tasks?
RQ2: Can mixed-data training recover cross-domain robustness?
RQ3: How should multiple specialized drafters be combined?
RQ4: What signals (confidence, entropy) explain routing and acceptance behavior?
RQ5: How does speculative depth affect the balance between exploration and exploitation?

Methodology

The study employs a controlled setup to isolate the effects of training data and composition strategies.

Common Setup:

Target Model (Verifier): Meta-Llama-3-8B-Instruct.
Draft Model Architecture: A lightweight 1-layer LLaMA-style decoder with ~0.8B parameters, sharing the tokenizer and vocabulary with the target model.
Evaluation Benchmarks: MT-Bench (conversation), GSM8K, MATH-500, and SVAMP (mathematical reasoning).
Primary Metric: Average acceptance length, defined as the expected number of consecutively accepted draft tokens per verifier call: $E_{x \sim \mathcal{D}}[A(x; M_D, M_T)]$
Speculative Backbones: Experiments are conducted using two modern drafting frameworks:
- EAGLE-2: A feature-level drafter that predicts future hidden states of the target model. Its training objective combines feature regression and cross-entropy: $\mathcal{L}_{\text{EAGLE}} = \sum_t \|\hat{h}_{t+1} - h_{t+1}\|_2^2 + \lambda \sum_t \text{CE}(\text{softmax}(W\hat{h}_{t+1}), x_{t+1})$
- HASS: Focuses on reducing objective and context mismatch. It uses a harmonized Top-K distillation loss: $\mathcal{L}_{\text{Top-K}} = -\sum_{x \in \hat{\Omega}} q(x) \log p(x)$ where $\hat{\Omega}$ is the set of top- $K$ tokens under the target model $q$ .

Draft Training Variants:

Single-Domain: Checkpoints trained solely on 70k examples from MathInstruct (math) or ShareGPT (conversation).
Mixed-Data: Checkpoints trained on balanced mixtures: Mixed 35k+35k and Mixed 70k+70k.
Composition Strategies: Methods for utilizing the two single-domain specialists:
- Checkpoint Averaging: Naive weight-space merging: $\theta_{\text{merge}} = \lambda\theta_{\text{math}} + (1-\lambda)\theta_{\text{chat}}$ (with $\lambda=0.5$ ).
- Confidence Routing: At inference, generate a draft tree from each specialist. Select the tree with the higher mean node confidence for verification: $\text{Score}(\mathcal{T}) = \frac{1}{|\mathcal{T}|} \sum_{v \in \mathcal{T}} c(v), \quad \mathcal{T}^* = \arg\max_{\mathcal{T} \in \{\mathcal{T}_{\text{math}}, \mathcal{T}_{\text{chat}}\}} \text{Score}(\mathcal{T})$
- Merged-Tree Verification: Pack both specialist trees under a shared root (with cross-subtree attention masked) and verify them in a single parallel pass, increasing proposal diversity.

Lossless Verification Rule: The core speculative decoding algorithm is fixed. A drafted token $\tilde{x}_{n+t}$ is accepted with probability:

\alpha_{n+t} = \min\left(1, \frac{q(\tilde{x}_{n+t} \mid x_{1:n+t-1})}{p(\tilde{x}_{n+t} \mid x_{1:n+t-1})}\right) \tag{1}

If rejected, sampling continues from the corrected distribution:

r(x) \propto \max(0, q(x \mid x_{1:n+t-1}) - p(x \mid x_{1:n+t-1})) \tag{2}

This ensures the final output distribution is identical to the target model's.

Empirical Validation / Results

The key results are consolidated in Table 1, with supporting evidence from routing tables and depth analyses.

Table 1: Main Results by Research Question (Acceptance Length)

Model Variant	Method	MT-Bench	GSM8K	MATH-500	SVAMP	Average	MT-Bench	GSM8K	MATH-500	SVAMP	Average
RQ1. Single-Domain		Temp=0					Temp=1
MathInstruct	HASS	2.90	5.02	5.35	3.13	4.10	2.31	4.75	4.63	2.46	3.54
MathInstruct	EAGLE-2	2.54	5.04	5.28	4.81	4.42	2.43	4.71	4.61	4.53	4.07
ShareGPT	HASS	3.98	4.09	3.98	4.44	4.12	3.50	4.03	3.61	3.95	3.77
ShareGPT	EAGLE-2	3.57	3.72	3.81	3.71	3.70	3.38	3.72	3.43	3.65	3.54
RQ2. Mixed-Data
Mixed 35k+35k	HASS	3.92	4.77	5.02	4.15	4.47	3.46	4.66	4.47	4.57	4.29
Mixed 35k+35k	EAGLE-2	3.37	4.12	4.44	4.16	4.02	3.10	4.08	4.02	4.03	3.81
Mixed 70k+70k	HASS	4.13	5.53	5.67	5.38	5.18	3.17	4.16	3.42	4.01	3.69
Mixed 70k+70k	EAGLE-2	3.75	4.68	4.85	4.64	4.48	2.99	3.76	3.20	3.08	3.26
RQ3. Composition
Averaged	HASS	2.29	2.80	3.12	2.13	2.59	2.10	2.78	2.90	2.69	2.62
Averaged	EAGLE-2	2.07	2.53	2.57	2.50	2.42	2.01	2.49	2.42	2.45	2.34
Confidence Routed	HASS	3.93	5.01	5.37	4.89	4.80	3.51	4.72	4.55	4.71	4.37
Confidence Routed	EAGLE-2	3.63	4.91	5.25	4.71	4.63	3.36	4.65	4.62	4.46	4.27
Merged Trees	HASS	4.05	5.42	5.65	5.31	5.11	3.76	5.21	4.98	5.05	4.75
Merged Trees	EAGLE-2	3.93	5.32	5.63	5.25	5.03	3.55	5.01	4.79	4.93	4.57

RQ1: Task-Specific Training Improves Matched-Domain Acceptance

Answer: Yes. Clear specialization is observed. ShareGPT-trained drafters excel on MT-Bench, while MathInstruct-trained drafters dominate on mathematical reasoning benchmarks (GSM8K, MATH-500, SVAMP). This pattern is consistent for both HASS and EAGLE-2 backbones.

RQ2: Mixed-Data Training Recovers Robustness, But Not Monotonically

Answer: Partially. Mixed-data checkpoints improve average performance and robustness across tasks. However, the larger Mixed 70k+70k variant performs best at temperature 0 but can degrade at temperature 1, being outperformed by the smaller Mixed 35k+35k mixture. This indicates mixed training broadens coverage but requires tuning for the intended decoding regime.

RQ3: Inference-Time Composition is Substantially Stronger than Weight Merging

Answer: Yes, decisively. Checkpoint averaging (weight-space merging) is the weakest method across all settings. In contrast, inference-time composition strategies are highly effective:
- Confidence Routing improves over the best single-domain baseline.
- Merged-Tree Verification achieves the highest acceptance length overall for both backbones at both temperatures, benefiting from increased proposal diversity.

RQ4: Confidence is a More Useful Routing Signal than Entropy

Answer: Yes. While both accepted and rejected tokens show a consistent entropy pattern (rejected tokens have higher entropy), confidence is far more discriminative for making routing decisions between specialists, as shown in Table 2.

Table 2: Routing Decisions by Benchmark (EAGLE-2)

Benchmark	Confidence Routing		Entropy Routing
	MathInstruct	ShareGPT	MathInstruct	ShareGPT
MT-Bench	15 (18.8%)	65 (81.2%)	42 (52.5%)	38 (47.5%)
GSM8K	1198 (90.8%)	121 (9.2%)	720 (54.6%)	599 (45.4%)
MATH-500	485 (97.0%)	15 (3.0%)	312 (62.4%)	188 (37.6%)
SVAMP	279 (93.0%)	21 (7.0%)	159 (53.0%)	141 (47.0%)

RQ5: Speculative Depth Affects Exploration-Exploitation Balance

Answer: Yes. Analysis of acceptance rates by depth (Figure 8, Appendix Tables 5-7) reveals that:
1. Acceptance rates decline with increasing speculative depth for all variants.
2. At shallow depths, mixed-data drafts often perform best, suggesting an exploration benefit from broader coverage.
3. At deeper depths, the task-matched specialist becomes increasingly dominant, especially on reasoning tasks, indicating a shift towards exploitation based on sustained drafter-verifier agreement.

Theoretical and Practical Implications

Draft Model as a Systems Choice: The quality of speculative decoding depends not only on the drafting architecture but critically on the alignment between the draft model's training distribution and the target workload. A mismatched drafter leads to predictably weaker performance on specific task families.
Superiority of Inference-Time Composition: When multiple specialized drafters are available, they should be kept separate and composed at inference time (via routing or tree merging). Naive weight-space averaging fails to preserve the specialized behaviors needed for high acceptance length.
Informed Deployment Strategies: The findings provide guidance for real-world deployment:
- For a known, specialized workload, train or fine-tune the drafter on matched data.
- For serving diverse workloads, consider mixed-data training (tuned for temperature) or, if resources allow, maintaining separate specialists with inference-time routing.
- Confidence is a simple and effective signal for router design.
Theoretical Correctness: The paper provides formal proofs (Appendix A.4) that both confidence routing and merged-tree verification preserve the target model's output distribution under the lossless speculative decoding constraint, ensuring correctness is maintained.

Conclusion

The study demonstrates that task-aware training of draft models significantly improves speculative decoding performance on matched domains. Furthermore, inference-time composition of specialized drafters (via confidence routing or merged-tree verification) is a substantially more effective strategy than weight-space averaging, achieving the highest acceptance lengths overall.

These results establish that proposal quality in speculative decoding is a function of both the draft architecture and the draft training distribution. The drafter should therefore be treated as a configurable systems component that can and should be optimized for the target workload, opening avenues for future work on adaptive and learned composition strategies.