Summary (Overview)

  • Core problem: Existing diffusion-based speculative decoding methods use a fixed block size for all inputs, which is suboptimal because the optimal number of tokens to generate in parallel varies across samples.
  • Key insight: The optimal block size is sample-dependent but exhibits a strong local structure—it concentrates within a narrow range around the training block size, making the problem a small discrete classification task.
  • Proposed method (BlockPilot) : A lightweight predictor that uses the predictive distribution of the last token after the target model’s prefilling stage to select an instance‑specific block size from a local candidate set. Prediction is performed once per sample and introduces minimal overhead.
  • Results: On Qwen3-4B at temperature 1, BlockPilot achieves an average acceptance length of 5.92 and a 4.20× speedup over standard autoregressive decoding, outperforming both EAGLE‑3 and all fixed-block DFlash variants across multiple models and benchmarks.
  • Contribution: First work to treat decoding policy (block size) as a learnable component in diffusion‑based speculative decoding, enabling consistent efficiency gains without modifying the draft or target models.

Introduction and Theoretical Foundation

Speculative decoding accelerates large language model inference by using a lightweight draft model to generate candidate tokens, which are then verified in parallel by the target model. Recently, diffusion-based speculative decoding (e.g., DFlash) has achieved state‑of‑the‑art performance by employing a diffusion language model (dLLM) as the drafter, generating a block of tokens in a single forward pass via block‑level diffusion. However, existing methods use a fixed inference block size inherited from training, assuming it is optimal for all inputs.

The authors challenge this assumption. They observe that the optimal block size—the number of tokens parallel‑drafted per step—depends on input‑specific characteristics such as contextual determinism and token‑level predictability. A larger block may yield higher parallelism but risks lower acceptance due to error accumulation, while a smaller block underutilizes the parallel capacity. Thus, block size should be treated as a sample‑dependent control variable, not a static hyperparameter.

Theoretical foundation: The average per‑token latency for a block‑wise speculative decoding step is [ L(B) = \frac{T_{\text{draft}}(B) + T_{\text{verify}}(B)}{\tau(B)} ] where τ(B)\tau(B) is the expected number of accepted tokens per step. Since both TdraftT_{\text{draft}} and TverifyT_{\text{verify}} grow sublinearly with BB, efficiency is primarily governed by τ(B)\tau(B). Maximizing τ(B)\tau(B) for each sample directly improves the end‑to‑end speedup η(B)=LAR/L(B)\eta(B) = L_{\text{AR}} / L(B).

Methodology

Key findings from exhaustive block‑size sweeps:

  1. Instance‑wise variability: Only a fraction of samples achieve optimal performance with the training block size; many prefer different sizes (Fig. 3a).
  2. Local interval property: Although BB^* varies, its distribution is strongly localized around the training block size BB—nearly all optimal values lie within [B3,B+3][B-3, B+3] (Fig. 3b–c). This reduces the search space to a small discrete interval.
  3. Classification formulation: The problem is cast as a supervised classification over the local candidate set B={Bk,,B+k}\mathcal{B} = \{B-k, \dots, B+k\}.

BlockPilot framework:

  • Input representation: The predictive probability distribution of the last token after prefilling is used as a compact state representation. This token attends to the full context and encodes information about future generation reliability.
  • Predictor architecture: A lightweight two‑layer MLP (hidden dimension 2048) takes the full distribution as input and outputs logits over the candidate block sizes. The predicted block size is B^(x)=argmaxbBf(p(x))b\hat{B}(x) = \arg\max_{b \in \mathcal{B}} f(p(x))_b.
  • Training data construction: For each training sample, the prefilling distribution p(x)p(x) is extracted, then the acceptance length τ(b;x)\tau(b;x) is evaluated for every candidate bb. The optimal size B(x)=argmaxbτ(b;x)B^*(x) = \arg\max_b \tau(b;x) serves as the label. Data is collected from ShareGPT, WSC, and COPA.
  • Loss function: Standard cross‑entropy loss over the small candidate set.

Inference: The predictor is invoked once after prefilling to select the block size, which is then used for all subsequent drafting steps. The overhead is minimal: the predictor has only 0.32B parameters and adds ~7.34 ms latency (Table 1).

Empirical Validation / Results

Experiments are conducted on Qwen3‑4B, Qwen3‑8B, Llama‑3.1‑8B‑Instruct, and Qwen3‑Coder‑30B‑A3B across Math (GSM8K, MATH‑500, AIME24), Code (HumanEval, MBPP, SWE‑Bench), and Chat (MT‑Bench) benchmarks. Baselines include standard autoregressive decoding, EAGLE‑3, and DFlash with fixed block sizes {4,8,16,32}.

Main Results (Table 2)

ModelTempMethodAvg. SpeedupAvg. τ\tau
Qwen3‑4B0EAGLE‑31.70×2.95
DFlash(16)3.99×6.31
BlockPilot4.17×6.59
1EAGLE‑31.70×2.88
DFlash(16)3.80×5.35
BlockPilot4.20×5.92
Qwen3‑8B0EAGLE‑31.75×2.93
DFlash(16)4.42×6.13
BlockPilot4.66×6.46
1EAGLE‑31.65×2.77
DFlash(16)3.55×5.00
BlockPilot3.94×5.55

BlockPilot consistently outperforms all fixed-block configurations. Notably, DFlash(32) often underperforms DFlash(16) due to increased drafting errors, while BlockPilot adaptively picks a suitable size for each sample.

Ablation Studies

  • Predictor configuration: A two‑layer MLP with hidden size 2048 provides the best trade‑off; larger sizes yield diminishing returns.
  • Candidate interval radius kk: k=2k=2 (range [B2,B+2][B-2, B+2]) gives the best results; k=1k=1 is too restrictive, k=3k=3 increases prediction difficulty.
  • Input preprocessing: Using the raw softmax distribution from prefilling is superior to normalization or additional softmax, as it preserves confidence information.

Theoretical and Practical Implications

  • Theoretical: The work formalizes block size as a learnable policy variable in diffusion‑based speculative decoding, revealing that inference efficiency depends not only on model architecture but also on instance‑adaptive decision‑making. The locality property (BB^* lies near BB) justifies a small, structured action space and makes learning tractable.
  • Practical: BlockPilot is plug‑and‑play—it requires no changes to the draft or target model, and only adds a single lightweight forward pass after prefilling. The speedups (e.g., 4.20× on Qwen3‑4B) represent significant practical acceleration for deployment. The small memory overhead (~0.62 GB) is negligible compared to backbone models.
  • Limitations: The current approach assumes the optimal block size can be predicted from the prefilling state alone; future work may explore dynamic block sizes across decoding steps.

Conclusion

BlockPilot demonstrates that instance‑adaptive block size selection can substantially improve the efficiency of diffusion‑based speculative decoding. By leveraging the locality of optimal block sizes and a lightweight predictor trained on prefilling representations, it achieves state‑of‑the‑art speedups across multiple model scales and tasks. The method is simple, efficient, and seamlessly integrates into existing frameworks. Future directions include extending adaptive policies to other decoding hyperparameters and exploring per‑step dynamic adjustments.

Related papers