# ELDR: Expert-Locality-Aware Decode Routing for PD-Disaggregated MoE Serving

> ELDR reduces median decode latency by 5.9–13.9% via expert locality routing, grouping requests by prefill-predicted expert activations.

- **Source:** [arXiv](https://arxiv.org/abs/2607.00466)
- **Published:** 2026-07-03
- **Permalink:** https://picx.dev/p/vpvWrV
- **Whiteboard:** https://picx.dev/p/vpvWrV/image

## Summary

## Summary (Overview)

- **Problem**: In Prefill-Decode (PD) disaggregated Mixture-of-Experts (MoE) serving, decode latency is dominated by the number of distinct experts activated per step, not just batch size. Existing load-balancing routers ignore expert locality, leading to higher latency when requests with dissimilar expert usage are colocated.
- **Key insight**: Expert activation is structured by domain (task, language) and predictable from prefill activations (correlation 0.70–0.92). This enables a second routing axis: **expert locality**.
- **Solution**: ELDR builds an expert signature from prefill activations, clusters signatures offline with Hungarian-balanced K-means (one centroid per decoder), and uses locality-band routing online to send each request to the least-loaded worker among those whose centroid is within a similarity band.
- **Results**: On three MoE models (Qwen3-30B-A3B, GPT-OSS-120B, Gemma-4-26B-A4B) and two workloads (task, language), ELDR reduces median TPOT by **5.9–13.9%** over the best load-balancing baseline, with model outputs unchanged. A signature cache co-indexed with the KV cache keeps signatures exact under prefix caching.
- **Scalability**: Generalizes to 235B models with expert parallelism (40 GPUs) and across decoder pool sizes (8–24 decoders).

## Introduction and Theoretical Foundation

### Background
Large Language Model (LLM) serving is moving toward **Prefill-Decode (PD) disaggregation**, where compute-bound prefill and memory-bandwidth-bound decode run on separate worker pools (xPrefill yDecode). This avoids phase interference but makes routing critical: each request is assigned a prefill worker and later a decode worker. Prefill-side routing often exploits KV cache affinity; decode-side routing only balances load.

For **dense models**, equal load implies equal work. For **Mixture-of-Experts (MoE) models**, this is false. MoE decode is memory-bandwidth bound: each step loads the weights of every distinct expert the batch activates. The **union of active experts**, not token count, governs latency. Growing active experts from 16 to 128 raises MoE-layer latency 4.7× at fixed batch size, while batch size barely moves it (Fig. 2).

### Key Observations
1. **Expert specialization by domain**: Task (code, math, medical, legal) and language (English, Chinese, Russian, French) requests activate distinct expert subsets (Fig. 1). Same-domain batches activate 17–21% (task) / 3–10% (language) fewer experts per step than mixed batches (Fig. 4).
2. **Prefill predicts decode**: Per-expert prefill and decode activation correlate at 0.70–0.92 across models (Fig. 3). The expert footprint is visible at the prefill→decode handoff, the moment the router must act.
3. **Opportunity**: A router that colocates similar requests shrinks each worker’s per-step expert union, reducing latency. Existing load-only routers scatter them.

### Challenges
- **Expert signature design**: Raw activation counts must be transformed into a vector space where proximity reflects decode-time expert overlap.
- **Locality vs. load**: Locality-only routing overloads popular domains (e.g., English+Chinese = 75% of WildChat, Fig. 5); load-only routing scatters similar requests. They require different information: aggregate (locality) vs. instantaneous (load).
- **Prefix cache coherence**: Prefix hits skip prefill, so expert activations for cached tokens are missing. Need a mechanism to recover the full signature.

## Methodology

### Expert Signature
ELDR summarizes each request’s prefill expert activations into a compact **expert signature** \(s_r\). The signature is designed so that cosine distance between signatures predicts decode-time expert overlap.

**Design principles**:
- **Discrete counts**: Use top-\(k\) token counts per layer, not continuous gate scores (which assign mass to never-loaded experts).
- **Downweight common experts**: Apply inverse document frequency (IDF) to each (layer, expert) cell:
  \[
  w(\ell, e) = \log\left(\frac{|C|+1}{\text{df}(\ell, e)+1}\right)
  \]
  where \(\text{df}(\ell, e)\) counts calibration requests in which expert \(e\) fires at layer \(\ell\).
- **Keep informative layers**: Greedy selection of layers that maximize rank correlation \(\rho\) (Eq. 1).

**Signature construction** (4 steps):
1. Count per-layer per-expert tokens: \(c_r(\ell) \in \mathbb{N}^E\).
2. Reweight: \(\tilde{c}_r(\ell,e) = c_r(\ell,e) \cdot w(\ell,e)\).
3. Select layers \(\mathcal{S}\) (greedy, keep \(N^*\) layers where cumulative \(\rho\) peaks).
4. L2-normalize:
   \[
   s_r = x_r / \|x_r\|_2, \quad x_r = \bigoplus_{\ell \in \mathcal{S}} \tilde{c}_r(\ell)
   \]

**Signature quality metric**:
\[
\rho = \text{Spearman}\left( \text{cos-dist}(s_i, s_j),\; \text{cos-dist}(p_i, p_j) \right)
\]
where \(p_i\) is the decode-time activation probability vector. High \(\rho\) means nearby signatures correspond to similar decode-time expert usage.

**Validation** (Fig. 7): Count·IDF achieves highest mean \(\rho\) (0.76) vs. continuous softmax (0.67) or binary (0.47). Layer masking improves \(\rho\) by 0.005–0.032 (Fig. 8).

### Decode Clustering and Routing

**Offline: Balanced K-means**
- Given \(K\) decode workers, cluster calibration signatures into \(K\) groups using **Hungarian-balanced K-means**: each centroid takes at most \(\lceil N/K \rceil\) points, minimizing total cosine distance. This ensures locality and balanced cluster sizes, preventing load imbalance.
- Result: one centroid per decoder, capturing semantic structure (Fig. 9: task domains and languages occupy distinct regions).

**Online: Locality-band routing**
- For each request, compute cosine similarity to all \(K\) centroids. Let \(s^* = \max_k s_k\).
- Among workers with \(s_k \geq s^* - \tau\) (the **locality band**), select the one with the smallest load (number of in-flight decode requests).
- Default \(\tau = 0.1\). \(\tau=0\) is pure top-1 (locality, ignores load); \(\tau=1\) is pure shortest-queue (load, ignores locality). The band adapts to signature confidence: confident signatures narrow the band, ambiguous ones widen it.

### Prefix Cache Coherence
ELDR stores expert signatures at **KV cache block granularity**: each block carries its tokens’ expert footprint. A request’s signature is the sum over its blocks:
\[
s_r = \sum_{b \in \mathcal{B}(r)} \text{sig}[b]
\]
This is exact whether blocks come from a cache hit (from an earlier request) or fresh computation. The signature cache is a preallocated GPU tensor, one int8 per (block, MoE layer, expert), sized <1% of KV cache. No additional allocator or eviction state needed.

## Empirical Validation / Results

### Setup
- **Testbed**: 5-node AMD MI300X cluster (8 GPUs/node, 192 GB HBM, 400 Gbps InfiniBand).
- **Models**: Qwen3-30B-A3B, GPT-OSS-120B, Gemma-4-26B-A4B (all TP=1). Also Qwen3-235B-A22B (TP=4, EP=4).
- **Workloads**: Task (11,668 prompts: code, math, medical, legal) and Language (14,000 WildChat prompts, skewed by language).
- **Topology**: 8P16D (8 prefiller, 16 decoder, 24 GPUs) unless noted.
- **Baselines**: Random, Round-Robin, Join-Shortest-Queue (JSQ), Power-of-Two-Choices (P2C), and Domain (oracle label-based). All share same prefill policy.

### Main Results

**Task workload (Fig. 11)**:
- ELDR reduces median TPOT by **7.0–13.9%** and tail TPOT by **3.4–6.0%** over the best load balancer.
- Outperforms Domain (which reduces median TPOT 6.8–9.7% over load balancers) by an additional 1.4–6.9% median and 1.6–4.5% tail.

**Language workload (Fig. 12)**:
- ELDR reduces median TPOT by **5.9–10.0%** over best load balancer; tail TPOT reduces by up to 9.6% at peak.
- Domain collapses: language labels are coarse proxies; ELDR’s finer signature clusters capture intra-language sub-structure.

### Overhead
- **Offline fit**: <10 seconds on CPU (greedy mask + balanced K-means on 1,000 calibration prompts).
- **Online per-request**: 0.86 ms, 1.2% of median TTFT (Table 1). Signature cache: 0.24% of HBM.

**Table 1: ELDR runtime overhead (Qwen3-30B-A3B, task, 8P16D, 60 req/s, median TTFT 69 ms)**

| Component | Locus | Time | % TTFT |
|-----------|-------|------|--------|
| record() hook | prefill host | 0.02 ms | <0.1 |
| reduce() scatter | prefill GPU | 0.48 ms | 0.7 |
| stage_sigs() D2H | prefill GPU | 0.21 ms | 0.3 |
| pop_sig fetch | scheduler | 7.0 μs | <0.1 |
| Route (τ-JSQ) | proxy | 0.15 ms | 0.2 |
| **Total** | | **0.86 ms** | **1.2** |

### Design Validation
- **Active-expert reduction**: ELDR reduces mean active experts per decode step by 22.0% vs. RR (Fig. 13).
- **Signature choice**: Count·IDF outperforms gate-prob by 3pp average TPOT P50 reduction (Fig. 14).
- **Balanced K-means**: Vanilla K-means regresses tail TPOT (+17.4% worst); balanced recovers both median and tail (Fig. 15).
- **Locality band width**: \(\tau=0.1\) removes tail regression seen at \(\tau=0\) while preserving median gains (Fig. 16).
- **Prefix cache composition**: ELDR’s TPOT advantage is preserved when prefix caching is on (Fig. 17).

### Generalization
- **Decoder pool size**: Median TPOT reduction grows monotonically from 8.0% (8P8D) to 10.2% (8P24D) (Table 2).
- **Large MoE with EP**: On Qwen3-235B-A22B (40 GPUs, EP=4), ELDR reduces median TPOT by 2.7–4.3% and tail TPOT by 0.6–2.0% (Fig. 18).

**Table 2: Topology generalization (Qwen3-30B-A3B, language, mean %Δ vs RR over 20–100 qps)**

| Metric | 8P8D | 8P16D | 8P24D |
|--------|------|-------|-------|
| TPOT P50 | −8.0% | −9.8% | −10.2% |
| TPOT P99 | −2.5% | −0.8% | −1.1% |

## Theoretical and Practical Implications

- **Theoretical**: Establishes expert locality as a first-order latency knob in MoE serving, decoupled from load. Shows that prefill activations provide a predictable signal for decode-time expert overlap, with strong rank correlation (ρ up to 0.76). The split of locality (aggregate, offline) and load (instantaneous, online) is a principled way to handle conflicting objectives.
- **Practical**: ELDR is lossless – it changes only which worker serves a request, not token-level expert selection, so model outputs are identical to standard top-k gating. It composes with prefix caching, expert parallelism, and existing load balancers. Implementation in vLLM adds only 0.86 ms overhead per request. The offline fit is cheap (<10 s) and can adapt to workload shifts.
- **Impact**: For PD-disaggregated MoE serving, ELDR reduces TPOT (especially median) significantly, improving user-perceived responsiveness. The approach is orthogonal to intra-worker techniques (e.g., expert load balancing) and can be combined.

## Conclusion

Decode routing in PD-disaggregated MoE serving has traditionally optimized only load. ELDR introduces **expert locality** as a second routing axis. It uses a prefill-derived expert signature, partitions the signature space with Hungarian-balanced K-means, and routes via a locality band that balances locality against live load. A block-granular signature cache ensures coherence with prefix caching. On three MoE models and two workloads, ELDR reduces median TPOT by 5.9–13.9% over the best load-balancing baseline, with model outputs unchanged. The approach generalizes across decoder pool sizes and to large MoE models with expert parallelism.

**Future directions**: Adapting to workload drift via periodic re-clustering, extending to multi-node deployments with heterogeneous hardware, and exploring integration with proactive cache-aware prefill routing.

---

_Markdown view of https://picx.dev/p/vpvWrV, served by PicX — AI-generated visual whiteboard summaries of research papers._