# ReFreeKV: Towards Threshold-Free KV Cache Compression

> ReFreeKV achieves near-lossless KV cache compression across diverse inputs without pre-defined budgets, using a universal 1% Frobenius norm threshold.

- **Source:** [arXiv](https://arxiv.org/abs/2502.16886)
- **Published:** 2026-07-01
- **Permalink:** https://picx.dev/p/u1MqWG
- **Whiteboard:** https://picx.dev/p/u1MqWG/image

## Summary

## Summary (Overview)

- **New objective for KV cache compression**: Proposes "threshold-free" pruning that does **not** rely on a pre-determined, input-specific budget threshold. Instead, the method automatically adapts the budget to maintain full-cache performance across diverse inputs.
- **ReFreeKV architecture**: A two-stage, parallelizable process combining a position-based initial ranking (exploiting attention sink and recency bias) followed by a sequential eviction step controlled by a universal metric (Uni‑Metric) based on the Frobenius norm of the attention matrix.
- **Universal stopping threshold**: Empirically identifies a threshold of \(T = 1\%\) that works consistently across models and tasks, enabling near-lossless compression without manual tuning.
- **Strong empirical validation**: Evaluated on 13 datasets covering math, reasoning, QA, summarization, and code, using Llama3, Mistral, and Qwen2.5 models (7B–72B). ReFreeKV matches or surpasses full-cache performance while automatically reducing KV cache budget (e.g., average budget 63.68% for Llama3-8B, 86.75% for Mistral-7B, 76.02% for Qwen2.5-7B).
- **Efficiency**: Pruning latency is comparable to prior methods (H2O, SnapKV, etc.) and often improves overall generation time due to higher compression on simple tasks.

---

## Introduction and Theoretical Foundation

Transformer-based LLMs store key-value (KV) vectors of all previous tokens during autoregressive decoding. The memory footprint of the KV cache grows linearly with sequence length, becoming a bottleneck for long-context inference. Many recent methods prune the KV cache by discarding less-important tokens, exploiting sparsity in attention. However, **almost all prior methods depend on an input-specific budget threshold** (e.g., retain 20% of cache, or keep 1024 tokens) that must be tuned per dataset to avoid performance loss.

**Motivation**: The optimal threshold varies drastically across domains and difficulty levels. For example, a 20% budget may work well on LongBench but fails on math reasoning (GSM8K). In real-world applications, input types are intermixed and unpredictable, making it infeasible to pre-select a single threshold. The paper demonstrates this problem in Table 1:

| Dataset | Budget = 50% (H2O) | Budget = 20% (H2O) |
|---------|---------------------|---------------------|
| GSM8K   | 41.29%              | 3.23%              |
| GPQA    | 9.99%               | 8.48%              |
| CoQA    | 99.30%              | 96.53%             |

*(Performance relative to full cache, Llama3-8B. H2O's performance collapses on GSM8K at 20% budget.)*

**New objective**: The paper advocates for *threshold-free* methods that satisfy two principles:
1. Operate with a **universal threshold** that is insensitive to input variations.
2. Consistently achieve **performance comparable to full-cache** by dynamically adjusting the KV budget.

Only after satisfying these criteria should one pursue aggressive compression.

---

## Methodology

### 3.1 Threshold-Free Objective
Formally, the paper proposes to lift the dependence on input-specific thresholds. Instead of fixing a budget \(B\), the method should automatically determine the retention depth \(i_{\text{prune}}\) such that performance degradation is bounded, independent of the input domain.

### 3.2 ReFreeKV – Two-Stage Process

#### Stage 1: Initial Ranking by Position
After prefilling the input sequence \(X = \{x_1, x_2, \dots, x_n\}\), ReFreeKV ranks the token positions as:
\[
\tilde{X} = \{x_1, x_2, \dots, x_m, x_n, x_{n-1}, \dots, x_{m+1}\}
\]
where the first \(m\) tokens (attention sinks, usually \(m=4\)) are prioritized, followed by the most recent tokens in reverse order. This exploits known positional biases (early tokens and recent tokens receive higher attention).

#### Stage 2: Eviction by Uni-Metric
ReFreeKV sequentially retains KV vectors from the ranked sequence. For each candidate position \(i\), it computes a **reduced attention matrix** \(\tilde{A}_i\) where scores to all positions beyond \(i\) are masked out. The **Uni-Metric** is based on the **Frobenius norm** of the attention matrix \(A \in \mathbb{R}^{n \times n}\):
\[
\|A\|_F = \sqrt{\sum_{i=1}^n \sum_{j=1}^n |A_{i,j}|^2}
\]

The stopping criterion finds the leftmost position \(i_{\text{prune}}\) where the norm difference exceeds a universal threshold \(T\):
\[
i_{\text{prune}} = \operatorname{argmin}_{j=1}^n \left( 1 - \frac{\|\tilde{A}_j\|_F}{\|A\|_F} < T \right)
\]
Empirically, \(T = 1\%\) is chosen (Figure 2 shows this balances performance and compression across models).

**Computational Optimization**: To avoid \(O(n^2)\) cost, the attention matrix is approximated by averaging **only the last \(k\) rows** (practical choice \(k=1\)):
\[
A'[i] = \frac{\sum_{j=k}^n A_{i,j}}{\sum_{j=k}^n \mathbf{1}_{\{A_{i,j} \neq 0\}}}
\]
Then the cumulative Frobenius norm for position \(i\) in the ranked order is:
\[
\|\tilde{A}'_i\|_F = \sqrt{\sum_{t=1}^i \left(A'_{\text{rank}}[t]\right)^2}
\]

The pruning position is found using parallel `torch.cumsum` and `torch.where` operators.

**Additional detail**: The first two layers (bottom layers) are always fully retained, as their attention distributions are more uniform and important for semantic understanding.

---

## Empirical Validation / Results

### 4.2 Main Results (Table 2)
ReFreeKV is compared against H2O, StreamingLLM (SLM), SnapKV, PyramidKV, and CAKE on 13 datasets. Key observations:

- **ReFreeKV** achieves near-lossless performance (and even surpasses full-cache for Llama3-8B by +0.12% and Qwen2.5-7B by +2.63%) with **no input-specific threshold tuning**. The automatic budget varies per dataset (e.g., 93.2% on GSM8K, 15.0% on QMSum).
- **Fixed-budget baselines**: At 90% budget, some methods approach full-cache performance, but at 50% or 20% budgets, performance collapses (e.g., H2O at 20% budget drops to 2.43% on GSM8K, 2.46% on GPQA).
- The dynamic budget of ReFreeKV naturally reflects task difficulty: complex math/science tasks require >90% budget, while simple QA/summarization tasks can use as little as 15–40% budget.

### Comparison with Twilight (Table 3)
Twilight uses a top-\(p\) inspired adaptive pruning but still requires model-specific tuning of \(p\). ReFreeKV achieves comparable or better results without such tuning.

| Model           | Method       | GSM8K | NQA  | Qasper | 2WkMQA | Musique |
|-----------------|--------------|-------|------|--------|--------|---------|
| Llama3-8B       | Ours         | 76.50 | 23.44| 37.38  | 36.21  | 15.96   |
| Llama3-8B       | Twi \(p=0.95\) | 76.04 | 23.37| 43.08  | 36.18  | 14.92   |
| Mistral-7B      | Ours         | 31.54 | 27.20| 38.93  | 33.46  | 22.15   |
| Mistral-7B      | Twi \(p=0.85\) | 30.33 | 27.17| 38.90  | 33.23  | 22.92   |

### Efficiency Analysis (Table 4)
End-to-end latency and pruning time for Llama3-8B and Llama3-70B across six datasets. ReFreeKV exhibits pruning latency comparable to baselines and achieves the **best overall generation time in 8 out of 12** comparisons, due to its adaptive compression.

| Model          | Method     | GSM8K Overall | Prune | Avg Overall | Avg Prune |
|----------------|------------|---------------|-------|-------------|-----------|
| Llama3-8B      | Ours       | 4.638         | 0.034 | 3.300       | 0.173     |
| Llama3-8B      | H2O (0.5)  | 4.686         | 0.020 | 3.318       | 0.176     |
| Llama3-70B     | Ours       | 15.975        | 0.154 | 8.290       | 1.623     |
| Llama3-70B     | SnapKV (0.5)| 16.517       | 0.117 | 8.487       | 1.529     |

*(All times in seconds, lower is better.)*

### Ablation Studies (Table 5, Figure 3)

- **Effect of \(k\) (number of rows averaged)**: Using \(k=1\) yields the best trade-off between compression and performance. Larger \(k\) (1%, 5%, 10% of \(n\)) either reduces compression or degrades performance (e.g., on NarrativeQA, 10%\(n\) gives 9.74 vs. 23.44 with \(k=1\)).
- **Alternative ranking by attention scores**: Replacing position-based ranking with attention-score ranking (e.g., H2O-style) fails to preserve performance, even with very low thresholds (0.01%).
- **Universal threshold sensitivity (Figure 3)**: The 1% threshold provides a robust balance. At 0.1%, budget increases but performance does not improve; at 10%, performance drops substantially (e.g., GSM8K retention falls to ~60%).

### Generalization to Larger Models (Table 6)
ReFreeKV applied to Llama3-70B, Qwen2.5-32B, and Qwen2.5-72B (same configuration as Table 2) consistently achieves near full-cache performance with average budgets around 50–90%, demonstrating cross-model generalization.

---

## Theoretical and Practical Implications

- **Practical significance**: The threshold-free design removes the need for dataset-specific hyperparameter tuning, making KV cache compression viable for open-domain, real-world LLM applications where input characteristics are unknown in advance.
- **Dynamic budget allocation**: By naturally allocating more cache to harder tasks and less to easier ones, ReFreeKV efficiently uses memory without sacrificing accuracy.
- **Efficiency**: The two-stage process is fully parallelizable (using `cumsum` and `where`), introducing negligible overhead (comparable to prior methods) while often improving overall throughput due to higher compression on simple tasks.
- **Potential for broader use**: The Uni-Metric (Frobenius norm of attention) provides a principled, input-agnostic signal for pruning quality, which could inspire future work beyond KV cache (e.g., adaptive sparse attention).

---

## Conclusion

ReFreeKV introduces a new objective for KV cache compression: **threshold-free** pruning that maintains full-cache performance without input-specific budget tuning. The method combines a simple position-based initial ranking with a universal metric based on the attention matrix Frobenius norm and a 1% stopping threshold. Extensive experiments across 13 datasets and multiple LLM families (Llama3, Mistral, Qwen2.5, up to 72B) demonstrate that ReFreeKV consistently achieves near-lossless compression while automatically adjusting the KV budget per input.

**Limitations**:
- The achieved compression ratio may still be suboptimal (e.g., on QMSum with Mistral-7B, 84.3% budget is retained while 50% would suffice).
- No formal guarantee on performance degradation; a small degradation (1.5%) was observed on Mistral-7B.

**Future directions**: Developing principled approaches to close the gap to the true optimal budget and provide formal performance guarantees.

---

_Markdown view of https://picx.dev/p/u1MqWG, served by PicX — AI-generated visual whiteboard summaries of research papers._