IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Summary (Overview)

Key Problem: The DeepSeek Sparse Attention (DSA) mechanism reduces core attention complexity from $O(L^2)$ to $O(Lk)$ via a lightweight "lightning indexer," but the indexer itself retains $O(L^2)$ complexity and must run at every layer, becoming a bottleneck for long-context inference.
Core Insight: The top- $k$ token selections produced by the indexer are highly similar across consecutive layers (70-100% overlap), indicating significant redundancy in per-layer indexer computations.
Main Solution: IndexCache partitions transformer layers into Full (F) layers (which compute fresh indices) and Shared (S) layers (which reuse indices from the nearest preceding F layer), dramatically reducing total indexer cost.
Methodology: Two complementary approaches:
- Training-free IndexCache: A greedy search algorithm selects which layers to retain as F layers by minimizing language modeling loss on a calibration set, requiring no weight updates.
- Training-aware IndexCache: A multi-layer distillation loss trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple uniform sharing patterns to match full-indexer accuracy.
Results: On a 30B DSA model, IndexCache removes 75% of indexer computations with negligible quality degradation, achieving up to 1.82× prefill speedup and 1.48× decode speedup at 200K context length. Preliminary results on the 744B GLM-5 model confirm scalability.

Introduction and Theoretical Foundation

The self-attention mechanism is a cornerstone of LLMs, but its quadratic complexity ( $O(L^2)$ ) is a fundamental bottleneck for long-context applications. Sparse attention addresses this by selecting only the most relevant tokens per query. DeepSeek Sparse Attention (DSA) is a production-grade solution that introduces a lightweight lightning indexer at each layer. This indexer scores all preceding tokens and selects the top- $k$ ( $k=2048$ ) for the subsequent sparse core attention, reducing per-layer core attention from $O(L^2)$ to $O(Lk)$ . However, the indexer itself still operates at $O(L^2)$ at every layer. Across $N$ layers, the total indexer cost is $O(NL^2)$ , which becomes a non-negligible fraction of total latency, especially during prefill at long contexts.

Profiling a 30B DSA model reveals that the indexer’s share of total latency rises sharply with context length, particularly during the prefill stage, while the rest of the computations grow only modestly. This indicates that reducing indexer cost is the key to accelerating long-context DSA inference.

A key motivating insight is the cross-layer stability of token selection. Prior work has shown that adjacent layers in transformers share the vast majority of their top- $k$ attention mass. This paper empirically verifies that this stability extends to DSA's indexer output: adjacent layers share 70-100% of their selected tokens. This redundancy presents a simple opportunity: can we remove most indexers and let layers reuse top- $k$ indices from a small number of retained indexer layers without degrading quality?

Methodology

Notation: $N$ transformer layers, sequence length $L$ , selected tokens per query $k$ . At layer $\ell$ , the indexer produces a score vector $I^{(ℓ)}_t \in \mathbb{R}^L$ for query position $t$ , from which the top- $k$ index set $T^{(ℓ)}_t = \text{Top-k}(I^{(ℓ)}_t)$ is extracted. $p^{(ℓ)}_t$ is the aggregated attention distribution at layer $\ell$ , and $q^{(ℓ)}_t = \text{Softmax}(I^{(ℓ)}_t)$ is the indexer's output distribution.

IndexCache Overview: Layers are partitioned into two roles encoded by a binary pattern string $c = c_1 c_2 \cdots c_N$ with $c_ℓ \in \{F, S\}$ :

F (Full): Retains its indexer, computes fresh $T^{(ℓ)}_t$ , and performs sparse core attention.
S (Shared): Has no indexer. It inherits the index set from the nearest preceding F layer: $T^{(ℓ)}_t \gets T^{(f(ℓ))}_t$ where $f(ℓ) = \max\{j < ℓ : c_j = F\}$ .

The first layer is always $F$ . At inference, an $S$ layer skips the indexer forward pass and reuses the cached index tensor. The only change to the inference loop is a single conditional branch.

Key Design Question: How to choose the pattern $c$ ? Two approaches are proposed.

3.1 Training-Free IndexCache

Given a pretrained DSA model, find a pattern $c$ that maximizes $S$ layers while minimizing impact on quality.

Why Uniform Interleaving Is Suboptimal: Naïve uniform interleaving (e.g., retaining every 4th layer's indexer) ignores that indexer importance varies significantly across layers. Certain layers (particularly early and transitional ones) are far more sensitive to removal.

Layer Selection Algorithm (Algorithm 1): A greedy search that incrementally converts $F$ layers to $S$ layers, using LM loss on a small calibration set as a proxy for downstream quality.

Calibration set: Cache $B$ mini-batches from training data.
Search procedure: Start from all- $F$ baseline. For $K$ steps (target number of $S$ layers), iterate over currently- $F$ layers (excluding layer 1), tentatively flip each to $S$ , evaluate LM loss, and commit the flip yielding the lowest loss.
Complexity: Full search performs $N(N-1)/2$ forward passes. Pipeline parallelism can accelerate by splitting layers into blocks.

Properties of the greedy solution:

Searched pattern outperforms uniform interleaving at same retention ratio.
LM validation loss curve reveals clear separation between "easy" and "critical" layers.
Results are stable across calibration sets; LM loss is a valid proxy for downstream tasks.

3.2 Training-Aware IndexCache with Multi-Layer Distillation

When training a DSA model from scratch or via continued pre-training, each retained indexer can be explicitly trained to serve multiple layers simultaneously.

From single-layer to multi-layer distillation: Standard DSA training distills each indexer at layer $\ell$ via KL divergence against its own layer's attention distribution $p^{(ℓ)}_t$ : $L_I = \sum_t D_{KL}(p^{(ℓ)}_t \parallel q^{(ℓ)}_t)$ .

We generalize to a multi-layer objective. Let layer $\ell$ be a retained $F$ layer, and layers $\ell+1, \ldots, \ell+m$ be subsequent $S$ layers that reuse its index set $T^{(ℓ)}_t$ . The multi-layer distillation loss is:

L_I^{\text{multi}} = \sum_{j=0}^{m} \frac{1}{m+1} \sum_t D_{KL}(p^{(ℓ+j)}_t \parallel q^{(ℓ)}_t)

(1)

Gradient equivalence to distillation against the averaged distribution: Define the averaged target $\bar{p}_t = \sum_{j=0}^{m} \frac{1}{m+1} p^{(ℓ+j)}_t$ and the corresponding single-target loss:

L_I^{\text{avg}} = \sum_t D_{KL}(\bar{p}_t \parallel q^{(ℓ)}_t)

(2)

Proposition 1: $\nabla_\theta L_I^{\text{multi}} = \nabla_\theta L_I^{\text{avg}}$ .

Proof: Since $q^{(ℓ)}_t$ is the only parameter-dependent term in $D_{KL}(p \parallel q^{(ℓ)}_t)$ , the entropy of $p$ vanishes under differentiation: $\nabla_\theta D_{KL}(p \parallel q^{(ℓ)}_t) = -\nabla_\theta \sum_s p(s) \log q^{(ℓ)}_t(s)$ . Applying to Eq. 1:

\begin{aligned} \nabla_\theta L_I^{\text{multi}} &= -\sum_{j=0}^{m} \frac{1}{m+1} \sum_t \nabla_\theta \sum_s p^{(ℓ+j)}_t(s) \log q^{(ℓ)}_t(s) \\ &= -\sum_t \nabla_\theta \sum_s \underbrace{\left( \sum_{j=0}^{m} \frac{1}{m+1} p^{(ℓ+j)}_t(s) \right)}_{\bar{p}_t(s)} \log q^{(ℓ)}_t(s) = \nabla_\theta L_I^{\text{avg}} \end{aligned}

(3)

Interpretation: Multi-layer distillation is exactly equivalent to distilling the indexer toward the centroid of the target layers' attention distributions. The indexer learns to predict a consensus top- $k$ that jointly covers important tokens across all served layers. $L_I^{\text{multi}}$ is adopted for implementation efficiency.

Training: Follows standard DSA two-stage procedure:

Warm-up phase: Train the indexer in the $F$ layer using $L_I^{\text{multi}}$ , while keeping other parameters fixed.
Sparse training phase: Continue to train indexer using $L_I^{\text{multi}}$ (KL computed only over selected top- $k$ tokens) and include LM loss to train remaining parameters.

Empirical Validation / Results

Setup: Experiments on a 30B DSA model derived from GLM-4.7-Flash (47 layers). Evaluated across five long-context benchmarks (MRCR v2, GraphWalks, LongBench v2, RULER, AA-LCR) and four general & reasoning benchmarks (AIME 2025, GPQA-Diamond, LiveCodeBench v6, IFBench).

4.2 End-to-End Inference Speedup

Measured on NVIDIA H100 node with dp attention enabled (dp size=8) in SGLang. Compared original DSA baseline against IndexCache at 1/2 and 1/4 indexer retention ratios.

Table 1: End-to-end inference performance of the 30B DSA model with IndexCache

Context Length	10K	60K	120K	200K
Prefill time (s) ↓
DSA	0.57	3.38	8.57	19.5
+ IndexCache (1/2)	0.47	2.86	6.57	13.7
+ IndexCache (1/4)	0.45	2.59	5.66	10.7
Decode throughput, per request (tok/s) ↑
DSA	73.5	67.0	63.0	58.0
+ IndexCache (1/2)	84.5	80.0	77.0	73.0
+ IndexCache (1/4)	91.0	89.5	88.0	86.0
Decode throughput, full KV cache (tok/s) ↑
DSA	2700	613	341	197
+ IndexCache (1/2)	3070	750	431	253
+ IndexCache (1/4)	3310	840	498	297

Key Speedup Results:

Prefill: At 200K tokens, IndexCache (1/4) reduces latency from 19.5s to 10.7s (1.82× speedup). Even at 10K, achieves 1.27× speedup.
Decode (per request): At 200K, DSA decode speed is 58 tok/s, IndexCache (1/4) achieves 86 tok/s (1.48× speedup).
Decode (full throughput): At 200K, IndexCache (1/4) improves total throughput from 197 to 297 tok/s (1.51× increase).

Figure 3: Relative speedup of IndexCache over DSA baseline across three inference settings shows gains increase with context length.

4.3 Training-Free IndexCache Results

Table 2: Training-free IndexCache at 1/2, 1/4, and 1/8 indexer retention

Config	Long Avg	G&R Avg	MRCR	GW	LB2	RULER	LCR	AIME	GPQA	LCB	IFB
Original DSA	50.2	74.6	24.5	49.6	45.5	87.9	43.6	91.0	77.6	71.4	58.4
1/2 Unif. IndexCache	47.4	74.3	22.0	46.6	46.0	83.6	38.6	92.2	76.4	69.7	59.0
+Search pattern	50.3	74.4	24.7	49.5	46.3	87.8	43.2	91.9	76.3	71.3	58.2
1/4 Unif. IndexCache	43.0	73.8	17.7	37.2	43.1	79.2	37.8	91.3	75.7	69.4	58.9
+Search pattern	49.9	74.9	25.1	47.4	45.7	87.6	43.8	92.6	78.6	70.0	58.3
1/8 Unif. IndexCache	35.3	70.0	12.9	33.1	37.7	68.8	24.0	89.1	74.1	58.7	58.0
+Search pattern	46.1	73.7	21.7	43.8	42.3	82.0	40.8	90.7	76.5	69.6	58.1

Key Findings:

Searched patterns close the gap: Uniform interleaving at aggressive ratios incurs significant degradation (Long Avg drops 2.8 and 7.2 points at 1/2 and 1/4). Greedy-searched pattern largely eliminates deficit, recovering Long Avg to 49.9 at 1/4 retention and 50.3 at 1/2 retention (comparable to original DSA).
Preserved reasoning capabilities: Except for uniform interleave at 1/8 ratio, General & Reasoning (G&R) Avg stays within 1 point of baseline (73.7-74.9 vs. 74.6). 1/4 searched pattern even improves on AIME 2025 (92.6 vs. 91.0) and GPQA-Diamond (78.6 vs. 77.6).
Extreme sparsity limit: At 1/8 retention, searched pattern still mitigates loss but degradation becomes non-negligible (Long Avg 46.1 vs. 50.2).

4.4 Training-Aware IndexCache Results

Table 3: Training-aware IndexCache at 1/2 and 1/4 indexer retention with uniform interleaving

Config	Long Avg	G&R Avg	MRCR	GW	LB2	RULER	LCR	AIME	GPQA	LCB	IFB
Original DSA	51.0	74.2	24.7	49.1	46.9	87.3	47.0	88.8	79.4	70.5	57.9
1/2 Unif. IndexCache	51.6	74.5	23.8	50.2	47.2	87.0	49.8	89.3	76.7	72.2	59.9
w/ searched pattern	50.6	73.6	23.9	48.1	47.1	87.5	46.6	89.6	78.6	68.5	57.7
w/o cross-layer loss	49.8	74.5	24.6	48.3	45.0	87.1	44.0	88.8	79.4	71.7	58.0
1/4 Unif. IndexCache	50.6	74.1	23.7	48.1	46.9	86.1	48.4	89.3	78.0	70.5	58.7