Summary (Overview)

  • Problem: In Prefill-Decode (PD) disaggregated Mixture-of-Experts (MoE) serving, decode latency is dominated by the number of distinct experts activated per step, not just batch size. Existing load-balancing routers ignore expert locality, leading to higher latency when requests with dissimilar expert usage are colocated.
  • Key insight: Expert activation is structured by domain (task, language) and predictable from prefill activations (correlation 0.70–0.92). This enables a second routing axis: expert locality.
  • Solution: ELDR builds an expert signature from prefill activations, clusters signatures offline with Hungarian-balanced K-means (one centroid per decoder), and uses locality-band routing online to send each request to the least-loaded worker among those whose centroid is within a similarity band.
  • Results: On three MoE models (Qwen3-30B-A3B, GPT-OSS-120B, Gemma-4-26B-A4B) and two workloads (task, language), ELDR reduces median TPOT by 5.9–13.9% over the best load-balancing baseline, with model outputs unchanged. A signature cache co-indexed with the KV cache keeps signatures exact under prefix caching.
  • Scalability: Generalizes to 235B models with expert parallelism (40 GPUs) and across decoder pool sizes (8–24 decoders).

Introduction and Theoretical Foundation

Background

Large Language Model (LLM) serving is moving toward Prefill-Decode (PD) disaggregation, where compute-bound prefill and memory-bandwidth-bound decode run on separate worker pools (xPrefill yDecode). This avoids phase interference but makes routing critical: each request is assigned a prefill worker and later a decode worker. Prefill-side routing often exploits KV cache affinity; decode-side routing only balances load.

For dense models, equal load implies equal work. For Mixture-of-Experts (MoE) models, this is false. MoE decode is memory-bandwidth bound: each step loads the weights of every distinct expert the batch activates. The union of active experts, not token count, governs latency. Growing active experts from 16 to 128 raises MoE-layer latency 4.7× at fixed batch size, while batch size barely moves it (Fig. 2).

Key Observations

  1. Expert specialization by domain: Task (code, math, medical, legal) and language (English, Chinese, Russian, French) requests activate distinct expert subsets (Fig. 1). Same-domain batches activate 17–21% (task) / 3–10% (language) fewer experts per step than mixed batches (Fig. 4).
  2. Prefill predicts decode: Per-expert prefill and decode activation correlate at 0.70–0.92 across models (Fig. 3). The expert footprint is visible at the prefill→decode handoff, the moment the router must act.
  3. Opportunity: A router that colocates similar requests shrinks each worker’s per-step expert union, reducing latency. Existing load-only routers scatter them.

Challenges

  • Expert signature design: Raw activation counts must be transformed into a vector space where proximity reflects decode-time expert overlap.
  • Locality vs. load: Locality-only routing overloads popular domains (e.g., English+Chinese = 75% of WildChat, Fig. 5); load-only routing scatters similar requests. They require different information: aggregate (locality) vs. instantaneous (load).
  • Prefix cache coherence: Prefix hits skip prefill, so expert activations for cached tokens are missing. Need a mechanism to recover the full signature.

Methodology

Expert Signature

ELDR summarizes each request’s prefill expert activations into a compact expert signature (s_r). The signature is designed so that cosine distance between signatures predicts decode-time expert overlap.

Design principles:

  • Discrete counts: Use top-(k) token counts per layer, not continuous gate scores (which assign mass to never-loaded experts).
  • Downweight common experts: Apply inverse document frequency (IDF) to each (layer, expert) cell: [ w(\ell, e) = \log\left(\frac{|C|+1}{\text{df}(\ell, e)+1}\right) ] where (\text{df}(\ell, e)) counts calibration requests in which expert (e) fires at layer (\ell).
  • Keep informative layers: Greedy selection of layers that maximize rank correlation (\rho) (Eq. 1).

Signature construction (4 steps):

  1. Count per-layer per-expert tokens: (c_r(\ell) \in \mathbb{N}^E).
  2. Reweight: (\tilde{c}_r(\ell,e) = c_r(\ell,e) \cdot w(\ell,e)).
  3. Select layers (\mathcal{S}) (greedy, keep (N^*) layers where cumulative (\rho) peaks).
  4. L2-normalize: [ s_r = x_r / |x_r|2, \quad x_r = \bigoplus{\ell \in \mathcal{S}} \tilde{c}_r(\ell) ]

Signature quality metric: [ \rho = \text{Spearman}\left( \text{cos-dist}(s_i, s_j),; \text{cos-dist}(p_i, p_j) \right) ] where (p_i) is the decode-time activation probability vector. High (\rho) means nearby signatures correspond to similar decode-time expert usage.

Validation (Fig. 7): Count·IDF achieves highest mean (\rho) (0.76) vs. continuous softmax (0.67) or binary (0.47). Layer masking improves (\rho) by 0.005–0.032 (Fig. 8).

Decode Clustering and Routing

Offline: Balanced K-means

  • Given (K) decode workers, cluster calibration signatures into (K) groups using Hungarian-balanced K-means: each centroid takes at most (\lceil N/K \rceil) points, minimizing total cosine distance. This ensures locality and balanced cluster sizes, preventing load imbalance.
  • Result: one centroid per decoder, capturing semantic structure (Fig. 9: task domains and languages occupy distinct regions).

Online: Locality-band routing

  • For each request, compute cosine similarity to all (K) centroids. Let (s^* = \max_k s_k).
  • Among workers with (s_k \geq s^* - \tau) (the locality band), select the one with the smallest load (number of in-flight decode requests).
  • Default (\tau = 0.1). (\tau=0) is pure top-1 (locality, ignores load); (\tau=1) is pure shortest-queue (load, ignores locality). The band adapts to signature confidence: confident signatures narrow the band, ambiguous ones widen it.

Prefix Cache Coherence

ELDR stores expert signatures at KV cache block granularity: each block carries its tokens’ expert footprint. A request’s signature is the sum over its blocks: [ s_r = \sum_{b \in \mathcal{B}(r)} \text{sig}[b] ] This is exact whether blocks come from a cache hit (from an earlier request) or fresh computation. The signature cache is a preallocated GPU tensor, one int8 per (block, MoE layer, expert), sized <1% of KV cache. No additional allocator or eviction state needed.

Empirical Validation / Results

Setup

  • Testbed: 5-node AMD MI300X cluster (8 GPUs/node, 192 GB HBM, 400 Gbps InfiniBand).
  • Models: Qwen3-30B-A3B, GPT-OSS-120B, Gemma-4-26B-A4B (all TP=1). Also Qwen3-235B-A22B (TP=4, EP=4).
  • Workloads: Task (11,668 prompts: code, math, medical, legal) and Language (14,000 WildChat prompts, skewed by language).
  • Topology: 8P16D (8 prefiller, 16 decoder, 24 GPUs) unless noted.
  • Baselines: Random, Round-Robin, Join-Shortest-Queue (JSQ), Power-of-Two-Choices (P2C), and Domain (oracle label-based). All share same prefill policy.

Main Results

Task workload (Fig. 11):

  • ELDR reduces median TPOT by 7.0–13.9% and tail TPOT by 3.4–6.0% over the best load balancer.
  • Outperforms Domain (which reduces median TPOT 6.8–9.7% over load balancers) by an additional 1.4–6.9% median and 1.6–4.5% tail.

Language workload (Fig. 12):

  • ELDR reduces median TPOT by 5.9–10.0% over best load balancer; tail TPOT reduces by up to 9.6% at peak.
  • Domain collapses: language labels are coarse proxies; ELDR’s finer signature clusters capture intra-language sub-structure.

Overhead

  • Offline fit: <10 seconds on CPU (greedy mask + balanced K-means on 1,000 calibration prompts).
  • Online per-request: 0.86 ms, 1.2% of median TTFT (Table 1). Signature cache: 0.24% of HBM.

Table 1: ELDR runtime overhead (Qwen3-30B-A3B, task, 8P16D, 60 req/s, median TTFT 69 ms)

ComponentLocusTime% TTFT
record() hookprefill host0.02 ms<0.1
reduce() scatterprefill GPU0.48 ms0.7
stage_sigs() D2Hprefill GPU0.21 ms0.3
pop_sig fetchscheduler7.0 μs<0.1
Route (τ-JSQ)proxy0.15 ms0.2
Total0.86 ms1.2

Design Validation

  • Active-expert reduction: ELDR reduces mean active experts per decode step by 22.0% vs. RR (Fig. 13).
  • Signature choice: Count·IDF outperforms gate-prob by 3pp average TPOT P50 reduction (Fig. 14).
  • Balanced K-means: Vanilla K-means regresses tail TPOT (+17.4% worst); balanced recovers both median and tail (Fig. 15).
  • Locality band width: (\tau=0.1) removes tail regression seen at (\tau=0) while preserving median gains (Fig. 16).
  • Prefix cache composition: ELDR’s TPOT advantage is preserved when prefix caching is on (Fig. 17).

Generalization

  • Decoder pool size: Median TPOT reduction grows monotonically from 8.0% (8P8D) to 10.2% (8P24D) (Table 2).
  • Large MoE with EP: On Qwen3-235B-A22B (40 GPUs, EP=4), ELDR reduces median TPOT by 2.7–4.3% and tail TPOT by 0.6–2.0% (Fig. 18).

Table 2: Topology generalization (Qwen3-30B-A3B, language, mean %Δ vs RR over 20–100 qps)

Metric8P8D8P16D8P24D
TPOT P50−8.0%−9.8%−10.2%
TPOT P99−2.5%−0.8%−1.1%

Theoretical and Practical Implications

  • Theoretical: Establishes expert locality as a first-order latency knob in MoE serving, decoupled from load. Shows that prefill activations provide a predictable signal for decode-time expert overlap, with strong rank correlation (ρ up to 0.76). The split of locality (aggregate, offline) and load (instantaneous, online) is a principled way to handle conflicting objectives.
  • Practical: ELDR is lossless – it changes only which worker serves a request, not token-level expert selection, so model outputs are identical to standard top-k gating. It composes with prefix caching, expert parallelism, and existing load balancers. Implementation in vLLM adds only 0.86 ms overhead per request. The offline fit is cheap (<10 s) and can adapt to workload shifts.
  • Impact: For PD-disaggregated MoE serving, ELDR reduces TPOT (especially median) significantly, improving user-perceived responsiveness. The approach is orthogonal to intra-worker techniques (e.g., expert load balancing) and can be combined.

Conclusion

Decode routing in PD-disaggregated MoE serving has traditionally optimized only load. ELDR introduces expert locality as a second routing axis. It uses a prefill-derived expert signature, partitions the signature space with Hungarian-balanced K-means, and routes via a locality band that balances locality against live load. A block-granular signature cache ensures coherence with prefix caching. On three MoE models and two workloads, ELDR reduces median TPOT by 5.9–13.9% over the best load-balancing baseline, with model outputs unchanged. The approach generalizes across decoder pool sizes and to large MoE models with expert parallelism.

Future directions: Adapting to workload drift via periodic re-clustering, extending to multi-node deployments with heterogeneous hardware, and exploring integration with proactive cache-aware prefill routing.

Related papers