Summary (Overview)

  • New objective for KV cache compression: Proposes "threshold-free" pruning that does not rely on a pre-determined, input-specific budget threshold. Instead, the method automatically adapts the budget to maintain full-cache performance across diverse inputs.
  • ReFreeKV architecture: A two-stage, parallelizable process combining a position-based initial ranking (exploiting attention sink and recency bias) followed by a sequential eviction step controlled by a universal metric (Uni‑Metric) based on the Frobenius norm of the attention matrix.
  • Universal stopping threshold: Empirically identifies a threshold of (T = 1%) that works consistently across models and tasks, enabling near-lossless compression without manual tuning.
  • Strong empirical validation: Evaluated on 13 datasets covering math, reasoning, QA, summarization, and code, using Llama3, Mistral, and Qwen2.5 models (7B–72B). ReFreeKV matches or surpasses full-cache performance while automatically reducing KV cache budget (e.g., average budget 63.68% for Llama3-8B, 86.75% for Mistral-7B, 76.02% for Qwen2.5-7B).
  • Efficiency: Pruning latency is comparable to prior methods (H2O, SnapKV, etc.) and often improves overall generation time due to higher compression on simple tasks.

Introduction and Theoretical Foundation

Transformer-based LLMs store key-value (KV) vectors of all previous tokens during autoregressive decoding. The memory footprint of the KV cache grows linearly with sequence length, becoming a bottleneck for long-context inference. Many recent methods prune the KV cache by discarding less-important tokens, exploiting sparsity in attention. However, almost all prior methods depend on an input-specific budget threshold (e.g., retain 20% of cache, or keep 1024 tokens) that must be tuned per dataset to avoid performance loss.

Motivation: The optimal threshold varies drastically across domains and difficulty levels. For example, a 20% budget may work well on LongBench but fails on math reasoning (GSM8K). In real-world applications, input types are intermixed and unpredictable, making it infeasible to pre-select a single threshold. The paper demonstrates this problem in Table 1:

DatasetBudget = 50% (H2O)Budget = 20% (H2O)
GSM8K41.29%3.23%
GPQA9.99%8.48%
CoQA99.30%96.53%

(Performance relative to full cache, Llama3-8B. H2O's performance collapses on GSM8K at 20% budget.)

New objective: The paper advocates for threshold-free methods that satisfy two principles:

  1. Operate with a universal threshold that is insensitive to input variations.
  2. Consistently achieve performance comparable to full-cache by dynamically adjusting the KV budget.

Only after satisfying these criteria should one pursue aggressive compression.


Methodology

3.1 Threshold-Free Objective

Formally, the paper proposes to lift the dependence on input-specific thresholds. Instead of fixing a budget (B), the method should automatically determine the retention depth (i_{\text{prune}}) such that performance degradation is bounded, independent of the input domain.

3.2 ReFreeKV – Two-Stage Process

Stage 1: Initial Ranking by Position

After prefilling the input sequence (X = {x_1, x_2, \dots, x_n}), ReFreeKV ranks the token positions as: [ \tilde{X} = {x_1, x_2, \dots, x_m, x_n, x_{n-1}, \dots, x_{m+1}} ] where the first (m) tokens (attention sinks, usually (m=4)) are prioritized, followed by the most recent tokens in reverse order. This exploits known positional biases (early tokens and recent tokens receive higher attention).

Stage 2: Eviction by Uni-Metric

ReFreeKV sequentially retains KV vectors from the ranked sequence. For each candidate position (i), it computes a reduced attention matrix (\tilde{A}i) where scores to all positions beyond (i) are masked out. The Uni-Metric is based on the Frobenius norm of the attention matrix (A \in \mathbb{R}^{n \times n}): [ |A|F = \sqrt{\sum{i=1}^n \sum{j=1}^n |A_{i,j}|^2} ]

The stopping criterion finds the leftmost position (i_{\text{prune}}) where the norm difference exceeds a universal threshold (T): [ i_{\text{prune}} = \operatorname{argmin}_{j=1}^n \left( 1 - \frac{|\tilde{A}_j|_F}{|A|_F} < T \right) ] Empirically, (T = 1%) is chosen (Figure 2 shows this balances performance and compression across models).

Computational Optimization: To avoid (O(n^2)) cost, the attention matrix is approximated by averaging only the last (k) rows (practical choice (k=1)): [ A'[i] = \frac{\sum_{j=k}^n A_{i,j}}{\sum_{j=k}^n \mathbf{1}{{A{i,j} \neq 0}}} ] Then the cumulative Frobenius norm for position (i) in the ranked order is: [ |\tilde{A}'i|F = \sqrt{\sum{t=1}^i \left(A'{\text{rank}}[t]\right)^2} ]

The pruning position is found using parallel torch.cumsum and torch.where operators.

Additional detail: The first two layers (bottom layers) are always fully retained, as their attention distributions are more uniform and important for semantic understanding.


Empirical Validation / Results

4.2 Main Results (Table 2)

ReFreeKV is compared against H2O, StreamingLLM (SLM), SnapKV, PyramidKV, and CAKE on 13 datasets. Key observations:

  • ReFreeKV achieves near-lossless performance (and even surpasses full-cache for Llama3-8B by +0.12% and Qwen2.5-7B by +2.63%) with no input-specific threshold tuning. The automatic budget varies per dataset (e.g., 93.2% on GSM8K, 15.0% on QMSum).
  • Fixed-budget baselines: At 90% budget, some methods approach full-cache performance, but at 50% or 20% budgets, performance collapses (e.g., H2O at 20% budget drops to 2.43% on GSM8K, 2.46% on GPQA).
  • The dynamic budget of ReFreeKV naturally reflects task difficulty: complex math/science tasks require >90% budget, while simple QA/summarization tasks can use as little as 15–40% budget.

Comparison with Twilight (Table 3)

Twilight uses a top-(p) inspired adaptive pruning but still requires model-specific tuning of (p). ReFreeKV achieves comparable or better results without such tuning.

ModelMethodGSM8KNQAQasper2WkMQAMusique
Llama3-8BOurs76.5023.4437.3836.2115.96
Llama3-8BTwi (p=0.95)76.0423.3743.0836.1814.92
Mistral-7BOurs31.5427.2038.9333.4622.15
Mistral-7BTwi (p=0.85)30.3327.1738.9033.2322.92

Efficiency Analysis (Table 4)

End-to-end latency and pruning time for Llama3-8B and Llama3-70B across six datasets. ReFreeKV exhibits pruning latency comparable to baselines and achieves the best overall generation time in 8 out of 12 comparisons, due to its adaptive compression.

ModelMethodGSM8K OverallPruneAvg OverallAvg Prune
Llama3-8BOurs4.6380.0343.3000.173
Llama3-8BH2O (0.5)4.6860.0203.3180.176
Llama3-70BOurs15.9750.1548.2901.623
Llama3-70BSnapKV (0.5)16.5170.1178.4871.529

(All times in seconds, lower is better.)

Ablation Studies (Table 5, Figure 3)

  • Effect of (k) (number of rows averaged): Using (k=1) yields the best trade-off between compression and performance. Larger (k) (1%, 5%, 10% of (n)) either reduces compression or degrades performance (e.g., on NarrativeQA, 10%(n) gives 9.74 vs. 23.44 with (k=1)).
  • Alternative ranking by attention scores: Replacing position-based ranking with attention-score ranking (e.g., H2O-style) fails to preserve performance, even with very low thresholds (0.01%).
  • Universal threshold sensitivity (Figure 3): The 1% threshold provides a robust balance. At 0.1%, budget increases but performance does not improve; at 10%, performance drops substantially (e.g., GSM8K retention falls to ~60%).

Generalization to Larger Models (Table 6)

ReFreeKV applied to Llama3-70B, Qwen2.5-32B, and Qwen2.5-72B (same configuration as Table 2) consistently achieves near full-cache performance with average budgets around 50–90%, demonstrating cross-model generalization.


Theoretical and Practical Implications

  • Practical significance: The threshold-free design removes the need for dataset-specific hyperparameter tuning, making KV cache compression viable for open-domain, real-world LLM applications where input characteristics are unknown in advance.
  • Dynamic budget allocation: By naturally allocating more cache to harder tasks and less to easier ones, ReFreeKV efficiently uses memory without sacrificing accuracy.
  • Efficiency: The two-stage process is fully parallelizable (using cumsum and where), introducing negligible overhead (comparable to prior methods) while often improving overall throughput due to higher compression on simple tasks.
  • Potential for broader use: The Uni-Metric (Frobenius norm of attention) provides a principled, input-agnostic signal for pruning quality, which could inspire future work beyond KV cache (e.g., adaptive sparse attention).

Conclusion

ReFreeKV introduces a new objective for KV cache compression: threshold-free pruning that maintains full-cache performance without input-specific budget tuning. The method combines a simple position-based initial ranking with a universal metric based on the attention matrix Frobenius norm and a 1% stopping threshold. Extensive experiments across 13 datasets and multiple LLM families (Llama3, Mistral, Qwen2.5, up to 72B) demonstrate that ReFreeKV consistently achieves near-lossless compression while automatically adjusting the KV budget per input.

Limitations:

  • The achieved compression ratio may still be suboptimal (e.g., on QMSum with Mistral-7B, 84.3% budget is retained while 50% would suffice).
  • No formal guarantee on performance degradation; a small degradation (1.5%) was observed on Mistral-7B.

Future directions: Developing principled approaches to close the gap to the true optimal budget and provide formal performance guarantees.

Related papers