Visual Summary | ReFreeKV: Towards Threshold-Free KV Cache Compression

Summary (Overview)

New objective for KV cache compression: Proposes "threshold-free" pruning that does not rely on a pre-determined, input-specific budget threshold. Instead, the method automatically adapts the budget to maintain full-cache performance across diverse inputs.
ReFreeKV architecture: A two-stage, parallelizable process combining a position-based initial ranking (exploiting attention sink and recency bias) followed by a sequential eviction step controlled by a universal metric (Uni‑Metric) based on the Frobenius norm of the attention matrix.
Universal stopping threshold: Empirically identifies a threshold of (T = 1%) that works consistently across models and tasks, enabling near-lossless compression without manual tuning.
Strong empirical validation: Evaluated on 13 datasets covering math, reasoning, QA, summarization, and code, using Llama3, Mistral, and Qwen2.5 models (7B–72B). ReFreeKV matches or surpasses full-cache performance while automatically reducing KV cache budget (e.g., average budget 63.68% for Llama3-8B, 86.75% for Mistral-7B, 76.02% for Qwen2.5-7B).
Efficiency: Pruning latency is comparable to prior methods (H2O, SnapKV, etc.) and often improves overall generation time due to higher compression on simple tasks.

Introduction and Theoretical Foundation

Transformer-based LLMs store key-value (KV) vectors of all previous tokens during autoregressive decoding. The memory footprint of the KV cache grows linearly with sequence length, becoming a bottleneck for long-context inference. Many recent methods prune the KV cache by discarding less-important tokens, exploiting sparsity in attention. However, almost all prior methods depend on an input-specific budget threshold (e.g., retain 20% of cache, or keep 1024 tokens) that must be tuned per dataset to avoid performance loss.

Motivation: The optimal threshold varies drastically across domains and difficulty levels. For example, a 20% budget may work well on LongBench but fails on math reasoning (GSM8K). In real-world applications, input types are intermixed and unpredictable, making it infeasible to pre-select a single threshold. The paper demonstrates this problem in Table 1:

Dataset	Budget = 50% (H2O)	Budget = 20% (H2O)
GSM8K	41.29%	3.23%
GPQA	9.99%	8.48%
CoQA	99.30%	96.53%

(Performance relative to full cache, Llama3-8B. H2O's performance collapses on GSM8K at 20% budget.)

New objective: The paper advocates for threshold-free methods that satisfy two principles:

Operate with a universal threshold that is insensitive to input variations.
Consistently achieve performance comparable to full-cache by dynamically adjusting the KV budget.

Only after satisfying these criteria should one pursue aggressive compression.

Methodology

3.1 Threshold-Free Objective

Formally, the paper proposes to lift the dependence on input-specific thresholds. Instead of fixing a budget (B), the method should automatically determine the retention depth (i_{\text{prune}}) such that performance degradation is bounded, independent of the input domain.

3.2 ReFreeKV – Two-Stage Process

Stage 1: Initial Ranking by Position

After prefilling the input sequence (X = {x_1, x_2, \dots, x_n}), ReFreeKV ranks the token positions as: [ \tilde{X} = {x_1, x_2, \dots, x_m, x_n, x_{n-1}, \dots, x_{m+1}} ] where the first (m) tokens (attention sinks, usually (m=4)) are prioritized, followed by the most recent tokens in reverse order. This exploits known positional biases (early tokens and recent tokens receive higher attention).

Stage 2: Eviction by Uni-Metric

ReFreeKV sequentially retains KV vectors from the ranked sequence. For each candidate position (i), it computes a reduced attention matrix (\tilde{A}i) where scores to all positions beyond (i) are masked out. The Uni-Metric is based on the Frobenius norm of the attention matrix (A \in \mathbb{R}^{n \times n}): [ |A|F = \sqrt{\sum{i=1}^n \sum{j=1}^n |A_{i,j}|^2} ]

The stopping criterion finds the leftmost position (i_{\text{prune}}) where the norm difference exceeds a universal threshold (T): [ i_{\text{prune}} = \operatorname{argmin}_{j=1}^n \left( 1 - \frac{|\tilde{A}_j|_F}{|A|_F} < T \right) ] Empirically, (T = 1%) is chosen (Figure 2 shows this balances performance and compression across models).

Computational Optimization: To avoid (O(n^2)) cost, the attention matrix is approximated by averaging only the last (k) rows (practical choice (k=1)): [ A'[i] = \frac{\sum_{j=k}^n A_{i,j}}{\sum_{j=k}^n \mathbf{1}{{A{i,j} \neq 0}}} ] Then the cumulative Frobenius norm for position (i) in the ranked order is: [ |\tilde{A}'i|F = \sqrt{\sum{t=1}^i \left(A'{\text{rank}}[t]\right)^2} ]

The pruning position is found using parallel torch.cumsum and torch.where operators.

Additional detail: The first two layers (bottom layers) are always fully retained, as their attention distributions are more uniform and important for semantic understanding.

Empirical Validation / Results

4.2 Main Results (Table 2)

ReFreeKV is compared against H2O, StreamingLLM (SLM), SnapKV, PyramidKV, and CAKE on 13 datasets. Key observations:

ReFreeKV achieves near-lossless performance (and even surpasses full-cache for Llama3-8B by +0.12% and Qwen2.5-7B by +2.63%) with no input-specific threshold tuning. The automatic budget varies per dataset (e.g., 93.2% on GSM8K, 15.0% on QMSum).
Fixed-budget baselines: At 90% budget, some methods approach full-cache performance, but at 50% or 20% budgets, performance collapses (e.g., H2O at 20% budget drops to 2.43% on GSM8K, 2.46% on GPQA).
The dynamic budget of ReFreeKV naturally reflects task difficulty: complex math/science tasks require >90% budget, while simple QA/summarization tasks can use as little as 15–40% budget.

Comparison with Twilight (Table 3)

Twilight uses a top-(p) inspired adaptive pruning but still requires model-specific tuning of (p). ReFreeKV achieves comparable or better results without such tuning.

Model	Method	GSM8K	NQA	Qasper	2WkMQA	Musique
Llama3-8B	Ours	76.50	23.44	37.38	36.21	15.96
Llama3-8B	Twi (p=0.95)	76.04	23.37	43.08	36.18	14.92
Mistral-7B	Ours	31.54	27.20	38.93	33.46	22.15
Mistral-7B	Twi (p=0.85)	30.33	27.17	38.90	33.23	22.92

Efficiency Analysis (Table 4)

End-to-end latency and pruning time for Llama3-8B and Llama3-70B across six datasets. ReFreeKV exhibits pruning latency comparable to baselines and achieves the best overall generation time in 8 out of 12 comparisons, due to its adaptive compression.

Model	Method	GSM8K Overall	Prune	Avg Overall	Avg Prune
Llama3-8B	Ours	4.638	0.034	3.300	0.173
Llama3-8B	H2O (0.5)	4.686	0.020	3.318	0.176
Llama3-70B	Ours	15.975	0.154	8.290	1.623
Llama3-70B	SnapKV (0.5)	16.517	0.117	8.487	1.529

(All times in seconds, lower is better.)

Ablation Studies (Table 5, Figure 3)

Effect of (k) (number of rows averaged): Using (k=1) yields the best trade-off between compression and performance. Larger (k) (1%, 5%, 10% of (n)) either reduces compression or degrades performance (e.g., on NarrativeQA, 10%(n) gives 9.74 vs. 23.44 with (k=1)).
Alternative ranking by attention scores: Replacing position-based ranking with attention-score ranking (e.g., H2O-style) fails to preserve performance, even with very low thresholds (0.01%).
Universal threshold sensitivity (Figure 3): The 1% threshold provides a robust balance. At 0.1%, budget increases but performance does not improve; at 10%, performance drops substantially (e.g., GSM8K retention falls to ~60%).

Generalization to Larger Models (Table 6)

ReFreeKV applied to Llama3-70B, Qwen2.5-32B, and Qwen2.5-72B (same configuration as Table 2) consistently achieves near full-cache performance with average budgets around 50–90%, demonstrating cross-model generalization.

Theoretical and Practical Implications

Practical significance: The threshold-free design removes the need for dataset-specific hyperparameter tuning, making KV cache compression viable for open-domain, real-world LLM applications where input characteristics are unknown in advance.
Dynamic budget allocation: By naturally allocating more cache to harder tasks and less to easier ones, ReFreeKV efficiently uses memory without sacrificing accuracy.
Efficiency: The two-stage process is fully parallelizable (using cumsum and where), introducing negligible overhead (comparable to prior methods) while often improving overall throughput due to higher compression on simple tasks.
Potential for broader use: The Uni-Metric (Frobenius norm of attention) provides a principled, input-agnostic signal for pruning quality, which could inspire future work beyond KV cache (e.g., adaptive sparse attention).

Conclusion

ReFreeKV introduces a new objective for KV cache compression: threshold-free pruning that maintains full-cache performance without input-specific budget tuning. The method combines a simple position-based initial ranking with a universal metric based on the attention matrix Frobenius norm and a 1% stopping threshold. Extensive experiments across 13 datasets and multiple LLM families (Llama3, Mistral, Qwen2.5, up to 72B) demonstrate that ReFreeKV consistently achieves near-lossless compression while automatically adjusting the KV budget per input.

Limitations:

The achieved compression ratio may still be suboptimal (e.g., on QMSum with Mistral-7B, 84.3% budget is retained while 50% would suffice).
No formal guarantee on performance degradation; a small degradation (1.5%) was observed on Mistral-7B.

Future directions: Developing principled approaches to close the gap to the true optimal budget and provide formal performance guarantees.