IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Summary (Overview)
- Key Problem: The DeepSeek Sparse Attention (DSA) mechanism reduces core attention complexity from to via a lightweight "lightning indexer," but the indexer itself retains complexity and must run at every layer, becoming a bottleneck for long-context inference.
- Core Insight: The top- token selections produced by the indexer are highly similar across consecutive layers (70-100% overlap), indicating significant redundancy in per-layer indexer computations.
- Main Solution: IndexCache partitions transformer layers into Full (F) layers (which compute fresh indices) and Shared (S) layers (which reuse indices from the nearest preceding F layer), dramatically reducing total indexer cost.
- Methodology: Two complementary approaches:
- Training-free IndexCache: A greedy search algorithm selects which layers to retain as F layers by minimizing language modeling loss on a calibration set, requiring no weight updates.
- Training-aware IndexCache: A multi-layer distillation loss trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple uniform sharing patterns to match full-indexer accuracy.
- Results: On a 30B DSA model, IndexCache removes 75% of indexer computations with negligible quality degradation, achieving up to 1.82× prefill speedup and 1.48× decode speedup at 200K context length. Preliminary results on the 744B GLM-5 model confirm scalability.
Introduction and Theoretical Foundation
The self-attention mechanism is a cornerstone of LLMs, but its quadratic complexity () is a fundamental bottleneck for long-context applications. Sparse attention addresses this by selecting only the most relevant tokens per query. DeepSeek Sparse Attention (DSA) is a production-grade solution that introduces a lightweight lightning indexer at each layer. This indexer scores all preceding tokens and selects the top- () for the subsequent sparse core attention, reducing per-layer core attention from to . However, the indexer itself still operates at at every layer. Across layers, the total indexer cost is , which becomes a non-negligible fraction of total latency, especially during prefill at long contexts.
Profiling a 30B DSA model reveals that the indexer’s share of total latency rises sharply with context length, particularly during the prefill stage, while the rest of the computations grow only modestly. This indicates that reducing indexer cost is the key to accelerating long-context DSA inference.
A key motivating insight is the cross-layer stability of token selection. Prior work has shown that adjacent layers in transformers share the vast majority of their top- attention mass. This paper empirically verifies that this stability extends to DSA's indexer output: adjacent layers share 70-100% of their selected tokens. This redundancy presents a simple opportunity: can we remove most indexers and let layers reuse top- indices from a small number of retained indexer layers without degrading quality?
Methodology
Notation: transformer layers, sequence length , selected tokens per query . At layer , the indexer produces a score vector for query position , from which the top- index set is extracted. is the aggregated attention distribution at layer , and is the indexer's output distribution.
IndexCache Overview: Layers are partitioned into two roles encoded by a binary pattern string with :
- F (Full): Retains its indexer, computes fresh , and performs sparse core attention.
- S (Shared): Has no indexer. It inherits the index set from the nearest preceding F layer: where .
The first layer is always . At inference, an layer skips the indexer forward pass and reuses the cached index tensor. The only change to the inference loop is a single conditional branch.
Key Design Question: How to choose the pattern ? Two approaches are proposed.
3.1 Training-Free IndexCache
Given a pretrained DSA model, find a pattern that maximizes layers while minimizing impact on quality.
Why Uniform Interleaving Is Suboptimal: Naïve uniform interleaving (e.g., retaining every 4th layer's indexer) ignores that indexer importance varies significantly across layers. Certain layers (particularly early and transitional ones) are far more sensitive to removal.
Layer Selection Algorithm (Algorithm 1): A greedy search that incrementally converts layers to layers, using LM loss on a small calibration set as a proxy for downstream quality.
- Calibration set: Cache mini-batches from training data.
- Search procedure: Start from all- baseline. For steps (target number of layers), iterate over currently- layers (excluding layer 1), tentatively flip each to , evaluate LM loss, and commit the flip yielding the lowest loss.
- Complexity: Full search performs forward passes. Pipeline parallelism can accelerate by splitting layers into blocks.
Properties of the greedy solution:
- Searched pattern outperforms uniform interleaving at same retention ratio.
- LM validation loss curve reveals clear separation between "easy" and "critical" layers.
- Results are stable across calibration sets; LM loss is a valid proxy for downstream tasks.
3.2 Training-Aware IndexCache with Multi-Layer Distillation
When training a DSA model from scratch or via continued pre-training, each retained indexer can be explicitly trained to serve multiple layers simultaneously.
From single-layer to multi-layer distillation: Standard DSA training distills each indexer at layer via KL divergence against its own layer's attention distribution : .
We generalize to a multi-layer objective. Let layer be a retained layer, and layers be subsequent layers that reuse its index set . The multi-layer distillation loss is:
(1)
Gradient equivalence to distillation against the averaged distribution: Define the averaged target and the corresponding single-target loss:
(2)
Proposition 1: .
Proof: Since is the only parameter-dependent term in , the entropy of vanishes under differentiation: . Applying to Eq. 1:
(3)
Interpretation: Multi-layer distillation is exactly equivalent to distilling the indexer toward the centroid of the target layers' attention distributions. The indexer learns to predict a consensus top- that jointly covers important tokens across all served layers. is adopted for implementation efficiency.
Training: Follows standard DSA two-stage procedure:
- Warm-up phase: Train the indexer in the layer using , while keeping other parameters fixed.
- Sparse training phase: Continue to train indexer using (KL computed only over selected top- tokens) and include LM loss to train remaining parameters.
Empirical Validation / Results
Setup: Experiments on a 30B DSA model derived from GLM-4.7-Flash (47 layers). Evaluated across five long-context benchmarks (MRCR v2, GraphWalks, LongBench v2, RULER, AA-LCR) and four general & reasoning benchmarks (AIME 2025, GPQA-Diamond, LiveCodeBench v6, IFBench).
4.2 End-to-End Inference Speedup
Measured on NVIDIA H100 node with dp attention enabled (dp size=8) in SGLang. Compared original DSA baseline against IndexCache at 1/2 and 1/4 indexer retention ratios.
Table 1: End-to-end inference performance of the 30B DSA model with IndexCache
| Context Length | 10K | 60K | 120K | 200K |
|---|---|---|---|---|
| Prefill time (s) ↓ | ||||
| DSA | 0.57 | 3.38 | 8.57 | 19.5 |
| + IndexCache (1/2) | 0.47 | 2.86 | 6.57 | 13.7 |
| + IndexCache (1/4) | 0.45 | 2.59 | 5.66 | 10.7 |
| Decode throughput, per request (tok/s) ↑ | ||||
| DSA | 73.5 | 67.0 | 63.0 | 58.0 |
| + IndexCache (1/2) | 84.5 | 80.0 | 77.0 | 73.0 |
| + IndexCache (1/4) | 91.0 | 89.5 | 88.0 | 86.0 |
| Decode throughput, full KV cache (tok/s) ↑ | ||||
| DSA | 2700 | 613 | 341 | 197 |
| + IndexCache (1/2) | 3070 | 750 | 431 | 253 |
| + IndexCache (1/4) | 3310 | 840 | 498 | 297 |
Key Speedup Results:
- Prefill: At 200K tokens, IndexCache (1/4) reduces latency from 19.5s to 10.7s (1.82× speedup). Even at 10K, achieves 1.27× speedup.
- Decode (per request): At 200K, DSA decode speed is 58 tok/s, IndexCache (1/4) achieves 86 tok/s (1.48× speedup).
- Decode (full throughput): At 200K, IndexCache (1/4) improves total throughput from 197 to 297 tok/s (1.51× increase).
Figure 3: Relative speedup of IndexCache over DSA baseline across three inference settings shows gains increase with context length.
4.3 Training-Free IndexCache Results
Table 2: Training-free IndexCache at 1/2, 1/4, and 1/8 indexer retention
| Config | Long Avg | G&R Avg | MRCR | GW | LB2 | RULER | LCR | AIME | GPQA | LCB | IFB |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Original DSA | 50.2 | 74.6 | 24.5 | 49.6 | 45.5 | 87.9 | 43.6 | 91.0 | 77.6 | 71.4 | 58.4 |
| 1/2 Unif. IndexCache | 47.4 | 74.3 | 22.0 | 46.6 | 46.0 | 83.6 | 38.6 | 92.2 | 76.4 | 69.7 | 59.0 |
| +Search pattern | 50.3 | 74.4 | 24.7 | 49.5 | 46.3 | 87.8 | 43.2 | 91.9 | 76.3 | 71.3 | 58.2 |
| 1/4 Unif. IndexCache | 43.0 | 73.8 | 17.7 | 37.2 | 43.1 | 79.2 | 37.8 | 91.3 | 75.7 | 69.4 | 58.9 |
| +Search pattern | 49.9 | 74.9 | 25.1 | 47.4 | 45.7 | 87.6 | 43.8 | 92.6 | 78.6 | 70.0 | 58.3 |
| 1/8 Unif. IndexCache | 35.3 | 70.0 | 12.9 | 33.1 | 37.7 | 68.8 | 24.0 | 89.1 | 74.1 | 58.7 | 58.0 |
| +Search pattern | 46.1 | 73.7 | 21.7 | 43.8 | 42.3 | 82.0 | 40.8 | 90.7 | 76.5 | 69.6 | 58.1 |
Key Findings:
- Searched patterns close the gap: Uniform interleaving at aggressive ratios incurs significant degradation (Long Avg drops 2.8 and 7.2 points at 1/2 and 1/4). Greedy-searched pattern largely eliminates deficit, recovering Long Avg to 49.9 at 1/4 retention and 50.3 at 1/2 retention (comparable to original DSA).
- Preserved reasoning capabilities: Except for uniform interleave at 1/8 ratio, General & Reasoning (G&R) Avg stays within 1 point of baseline (73.7-74.9 vs. 74.6). 1/4 searched pattern even improves on AIME 2025 (92.6 vs. 91.0) and GPQA-Diamond (78.6 vs. 77.6).
- Extreme sparsity limit: At 1/8 retention, searched pattern still mitigates loss but degradation becomes non-negligible (Long Avg 46.1 vs. 50.2).
4.4 Training-Aware IndexCache Results
Table 3: Training-aware IndexCache at 1/2 and 1/4 indexer retention with uniform interleaving
| Config | Long Avg | G&R Avg | MRCR | GW | LB2 | RULER | LCR | AIME | GPQA | LCB | IFB |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Original DSA | 51.0 | 74.2 | 24.7 | 49.1 | 46.9 | 87.3 | 47.0 | 88.8 | 79.4 | 70.5 | 57.9 |
| 1/2 Unif. IndexCache | 51.6 | 74.5 | 23.8 | 50.2 | 47.2 | 87.0 | 49.8 | 89.3 | 76.7 | 72.2 | 59.9 |
| w/ searched pattern | 50.6 | 73.6 | 23.9 | 48.1 | 47.1 | 87.5 | 46.6 | 89.6 | 78.6 | 68.5 | 57.7 |
| w/o cross-layer loss | 49.8 | 74.5 | 24.6 | 48.3 | 45.0 | 87.1 | 44.0 | 88.8 | 79.4 | 71.7 | 58.0 |
| 1/4 Unif. IndexCache | 50.6 | 74.1 | 23.7 | 48.1 | 46.9 | 86.1 | 48.4 | 89.3 | 78.0 | 70.5 | 58.7 |
Key Findings:
- Matches DSA baseline: Uniform IndexCache with 1/2 ratio achieves Long Avg 51.6 (surpassing baseline 51.0), G&R Avg comparable (74.5 vs. 74.2). At 1/4 retention, both averages within 0.4% of baseline.
- Pattern sensitivity vanishes: Uniform interleaving at 1/2 retention performs on par with/slightly above greedy-searched pattern (